Efficient Django Delete Cascade

Background

I'm working on a project that is developed using django. Part of the is project requires deleting n-level-deep relational data with a huge quantity of detail records at the lowest level.

Django allows for the setting of on-delete actions in the ForeignKey instantiation. Unfortunately, the on-delete action is not set in the database. It turns out that django is designed to handle cascade actions in the ORM at the application level. What this means is that it can use up a lot of memory resources as it walks relations, instantiates objects and deletes them through its ORM functionality.

Our project is deployed in Openshift. This ORM activity was causing the pod to be halted as it OOM'd. This was a drag for everyone and was causing data processing operations to back up.

So, what to do about it? I decided to not worry about django messaging at the lower cascade levels and instead manage the cascade myself. This required the development of some utility functions to compile QuerySet instances into SQL and another function to recursively walk model relations. The idea was to have the relation walker recursively reach the lowest relations level, then execute SQL directly on its way back up. Also, there is no query execution until the recursion works its way back. The benefit here is that we can use the ORM to construct a top-level query for records to delete, then the cascade function will find all of the models that depend on records from the target and execute the SQL necessary to simulate the delete cascade functionality that a database like PostgreSQL can handle.

This design was intended to be a compromise for ease of use in our application and not have to write a complicated django application extension or change the entire applications model inheritance to use a subclass of models.Model.

Design

Utility Functions

This required the creation of small set of utility functions to handle compilation of a QuerySet instance into either DELETE or UPDATE SQL statements. Also, a common function was needed to directly execute the compiled SQL with parameters

The delete SQL compiler:

from django.db import transaction
from django.db.models.sql.compiler import SQLDeleteCompiler


def get_delete_sql(query):
    """
    Compile a DELTE SQL statement from a QuerySet instance
    Params:
        query (QuerySet) : The query to compile to SQL
    Returns (tuple):
        (sql, params) : sql is the sql statement to exeucte
                        params are any parameters for the statement to use. This is a tuple.
    """
    return SQLDeleteCompiler(query.query, transaction.get_connection(), query.db).as_sql()

The update SQL compiler:

from django.db import models


def get_update_sql(query, **updatespec):
    """
    Compile the query with the update specifications into an UPDATE SQL statement and parameters
    Params:
        query (QuerySet) : The QuerySet that will select the row(s) to update
        updatespec (dict) : {column: new_value} expressed in the function call as `column=new_value` named parameters
    Returns (tuple):
        (sql, params) : sql is the sql statement to exeucte
                        params are any parameters for the statement to use. This is a tuple.
    """
    assert query.query.can_filter()
    query.for_write = True
    q = query.query.chain(models.sql.UpdateQuery)
    q.add_update_values(updatespec)
    q._annotations = None

    return q.get_compiler(query.db).as_sql()

The executor:

import logging

from django.db import transaction
from sqlparse import format as format_sql


LOG = logging.getLogger(__name__)


def execute_compiled_sql(sql, params=None):
    """
    Execute the SQL with any parameters directly using connection.cursor()
    Params:
        sql (str) : The parameterized SQL statement.
        params (tuple/list) : Parameters for the sql statement or None
    Returns :
        int : rows affected by the statement
    """
    rows_affected = 0
    with transaction.get_connection().cursor() as cur:
        params = params or None
        LOG.debug(format_sql(cur.mogrify(sql, params).decode("utf-8"), reindent_aligned=True))
        cur.execute(sql, params)
        rows_affected = cur.rowcount

    return rows_affected

This executor function will log the actual sql statement (with parameters inline) at the DEBUG level.

Cascade Function

The cascade function will walk all relations defined in the django model and will directly execute CASCADE and SET_NULL on-delete actions. Anything else is passed to the database and will return any database exception needed on error.

To call this function, all you need is the model from which you wish to start the cascade actions and a query against that model that will return the row(s) you wish to delete.

def cascade_delete(from_model, instance_pk_query, skip_relations=[], base_model=None, level=0):
    """
    Performs a cascading delete by walking the Django model relations and executing compiled SQL
    to perform the on_delete actions instead or running the collector.
    Parameters:
        from_model (models.Model) : A model class that is the relation root
        instance_pk_query (QuerySet) : A query for the records to delete and cascade from
        base_model (None; Model) : The root model class, If null, this will be set for you.
        level (int) : Recursion depth. This is used in logging only. Do not set.
        skip_relations (Iterable of Models) : Relations to skip over in case they are handled explicitly elsewhere
    """
    if base_model is None:
        base_model = from_model
    instance_pk_query = instance_pk_query.values_list("pk").order_by()
    LOG.info(f"Level {level} Delete Cascade for {base_model.__name__}: Checking relations for {from_model.__name__}")
    for model_relation in from_model._meta.related_objects:
        related_model = model_relation.related_model
        if related_model in skip_relations:
            LOG.info(f"SKIPPING RELATION {related_model.__name__} from caller directive")
            continue

        if model_relation.on_delete.__name__ == "SET_NULL":
            filterspec = {f"{model_relation.remote_field.column}__in": models.Subquery(instance_pk_query)}
            updatespec = {f"{model_relation.remote_field.column}": None}
            LOG.info(
                f"    Executing SET NULL constraint action on {related_model.__name__}"
                f" relation of {from_model.__name__}"
            )
            rec_count = execute_update_sql(related_model.objects.filter(**filterspec), **updatespec)
            LOG.info(f"    Updated {rec_count} records in {related_model.__name__}")
        elif model_relation.on_delete.__name__ == "CASCADE":
            filterspec = {f"{model_relation.remote_field.column}__in": models.Subquery(instance_pk_query)}
            related_pk_values = related_model.objects.filter(**filterspec).values_list(related_model._meta.pk.name)
            LOG.info(f"    Cascading delete to relations of {related_model.__name__}")
            cascade_delete(
                related_model, related_pk_values, base_model=base_model, level=level + 1, skip_relations=skip_relations
            )

    LOG.info(f"Level {level}: delete records from {from_model.__name__}")
    if level == 0:
        del_query = instance_pk_query
    else:
        filterspec = {f"{from_model._meta.pk.name}__in": models.Subquery(instance_pk_query)}
        del_query = from_model.objects.filter(**filterspec)

    rec_count = execute_delete_sql(del_query)
    LOG.info(f"Deleted {rec_count} records from {from_model.__name__}")

Note the skip_relations override. This is necessary for special cases that are handled outside of the cascade process and would either be unnecessary work or would possibly introduce an error. For my project, it was to skip partitioned tables where the delete involved simply dropping a partition. This would be unnecessary work and would be inefficient since all partitions could be scanned.

So the cascade will be handled appropriately as it checks all relations at every level and executes deletes bottom-up. This allows us to use the existing model relations and allows us to use the ORM QuerySet to build queries as normal.

I hope that you will find this useful if you are also using django and need to delete large amounts of related data via cascading actions.

Addendum

The project I'm working on makes heavy use of deferred constraints. If you are using this cascade delete and your database has deferred constraints, you may want to set them to immediate for this transaction.

19