19
How to Run SQL Queries on Pandas Data Frames?
Pandas may be the de-facto standard library for data analysis in Python. Yet, most data scientists are comfortable with SQL than pandas data operations.
At times it may feel like things are easy to do in SQL. Wouldn't it be great if we have a way to get the best of both worlds?
It does.
A wonderful library in the Python ecosystem makes it possible to do this.
Pandasql is an open-source package that lets you run SQL code on pandas data frames. You can check out its repository on GitHub.
You can install Pandasql with PyPI as follows.
pip install pandasql
Once you've installed it, you can use its self function to query any data frame in the memory.
from pandasql import sqldf
from sklearn import datasets
data = datasets.load_wine(as_frame=True)['data']
sqldf("SELECT * FROM data LIMIT 10", globals())
We've imported the wine recognition dataset available in the scikit learn toy dataset repository in the above example. We then queried the first ten records from the dataset.
The first argument to the sqldf
function is straightforward to understand. It's a SQL query. The second argument, global()
tells the function use the global context to find the dataset. The other possible value for this argument is local()
which tells the interpreter to look only within the context of a function block.
If we're to do the same using only Pandas operations, it will look as follows.
df[:10]
Both operations results in the following dataset.
At first, it may seem counterintuitive to add complexity to the trivial task of querying the first ten records. But, as our query gets bigger, it becomes easier to do it using a SQL query than the Pandas way.
PRO TIP: Query pandas using SQL without having to mention the global environment._
We can use the functool.partial
utility to create a partial function that doesn't require you to mention the environment every time. Here's how to do it.
from functools import partial
gpandsql = partial(sqldf, env=globals())
gpandsql("SELECT * FROM data LIMIT 10")
Pandasql
uses sqlite as it's temporary backend. You can perform all kinds of SQL operations sqlite supports. This includes GROUPBY, WHERE, and different kinds of JOINS.
Here's an example:
from functools import partial
from pandasql import sqldf
from sklearn import datasets
data = datasets.load_wine(as_frame=True)['data']
target = datasets.load_wine(as_frame=True)['target']
gpandsql = partial(sqldf, env=globals())
gpandsql("""
SELECT *
FROM data d
LEFT JOIN target t
WHERE
d.ash = 1.36
LIMIT 5"""
).describe()
You can even perform more advanced operations such as window functions.
However, SQLite is not a database designed for large-scale applications. Also, we should note that SQLite has some limitations we need to be aware of.
Also, note that we can only query dataframes using Pandasql. We can not create, update, or delete records using this library.
Most data scientists work well in SQL databases. For some, it isn't easy to use Pandas operations.
For them, Pandasql is a fantastic tool to enjoy the best of both worlds.
I've also noticed that SQL queries are excellent for readability. They follow the natural language flow, making it easy for even novice programmers to grasp.
19