Python and SQL

After working on some Javascript issues, I wanted to try a new problem on Python and luckily, in one of the issues of the project I was working on was on Python and SQL.

Taking a look at the issue:

The issue can be found here.

To solve the issue, I need to:

learn more about Python in general

learn about sqlalchemy

learn more about SQL injection

learn how to use sqlalchemy to prevent SQL injection

Approaching the issue:

Before trying to solve the problem, I took a quick look at the file leads.py to figure out its usage. It acts as an api route "/leads" that serves data or creates new data.

Moreover, what is sqlalchemy? Simply put, is a Python Object Oriented Mapper that allows developers to work with SQL, it hides and encapsulates SQL queries into objects.

SQL Injection can happen when you take user's input, put it into raw string SQL statements. For example:

query = text(
            """
            SELECT
                {columns}
            FROM lead_tag lt
            JOIN leads l on l.id = lt.lead_id
            JOIN tags t on t.id = lt.tag_id
            WHERE to_tsvector(l.company_name) @@ to_tsquery('{search}')
                AND lt.tag_id = :tag_id
            ORDER BY ts_rank(to_tsvector(company_name), '{search}')
            LIMIT :limit
            OFFSET :offset
        """.format(
                columns=",".join("l." + f for f in DEFAULT_LEAD_FIELDS),
                search=search,
            )
        )

The piece of code uses string interpolation to add user's input into the SQL statement which is vulnerable to SQL injection, attackers can easily insert a SQL statement that can change your tables. Therefore, we need to convert this piece of code into sqlalchemy which can turn this query into object and protect our server from injection.

Bonus: a very helpful article on SQL injection

Coming up with a solution

There are many ways to convert raw SQL statement into sqlalchemy expressions, I tried to change as little as possible by working around with the existing code res = connection.execute(query, **query_args)

with db.get_connection() as connection:
        res = connection.execute(query, **query_args)
        response_body = []
        count = 0
        for row in res:
            # TODO: handle potential errors if the user chooses a field not in the row
            lead = {field: getattr(row, field) for field in include}
            if drop_null:
                lead = {k: v for (k, v) in lead.items() if v is not None}
            response_body.append(lead)
            count += 1
        return {
            "count": count,
            "query": {
                "page": page,
                "perpage": perpage,
            },
            "leads": response_body,
        }, 200

All I needed to do was to change the search interpolation into bound parameters similar to LIMIT and OFFSET.

query = text(
            """
            SELECT
                {columns}
            FROM lead_tag lt
            JOIN leads l on l.id = lt.lead_id
            JOIN tags t on t.id = lt.tag_id
            WHERE to_tsvector(l.company_name) @@ to_tsquery(:search)
                AND lt.tag_id = :tag_id
            ORDER BY ts_rank(to_tsvector(company_name), :search)
            LIMIT :limit
            OFFSET :offset
        """.format(
                columns=",".join("l." + f for f in DEFAULT_LEAD_FIELDS),
            )
        )
        query_args = {
            "limit": limit,
            "offset": offset,
            "tag_id": tag_id,
            "search": search,
        }

The SQL statement will now be executed using execute method with its arguments query_args, if injection occurs, it will auto-escape.

Conclusion

I love this issue, it allowed me to learn new things (Python, SQL injection, ORM), in the end I felt very satisfied. Though, the issue seems easy, it took me about 2 days to let all the new information sink in before attempting the PR.