Improving performance with SQL aggregate functions
In this article, you will learn how SQL aggregate functions can represent an easy way to significantly improve your application's performance. Mainly, you will see how they were a game-changer in a real-world scenario based on a data-driven application developed for a startup operating in the sports industry.
Let's now delve deeper into this scenario and learn why you can't ignore SQL aggregate functions in data science.
The application I recently worked on aims to offer advanced data exploration features in the sports world through the web. In particular, it needs to allow exploration of both raw and aggregated data. Since the database involves terabytes of heterogeneous and unstructured data, the challenges were mostly on the backend and database side. Now, let's dive into this scenario.
We developed the backend in Kotlin with the Spring Boot 2.5.3 framework and the Hibernate 5.4.32.Final ORM (Object Relational Mapping). We deployed it on an 8GB 4 CPU VPS through a Docker container managed by Dokku. The initial heap size was set to 2GB and limited to 7GB, while we allocated the remaining GB of RAM to a Redis-based caching system. We built the web application with performance in mind. Specifically, it's based on the multi-layered Spring Boot architecture described here and involves multi-thread processing.
We implemented the database as a MySQL server running on an 8GB 2 CPU VPS. We hosted the backend application and the database in the same server farm, but they do not share the same VPS. Since the sports data is simple but highly heterogeneous, the database was structured to avoid duplication and encourage standardization. This structure is why we chose a relational database. As it stands, the database involves hundreds of tables, and I cannot present it entirely here due to an NDA.
Luckily, the most problematic tables share more or less the same structure. So, analyzing just one table should be enough. In particular, this is what the PositionalData table looks like:
CREATE TABLE `PositionalData` ( `id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY, `area1` double DEFAULT NULL, `area2` double DEFAULT NULL, `area3` double DEFAULT NULL, `area4` double DEFAULT NULL, `area5` double DEFAULT NULL, ... `area140` double DEFAULT NULL, `area141` double DEFAULT NULL, `area142` double DEFAULT NULL, `area143` double DEFAULT NULL, `area144` double DEFAULT NULL, `value` double DEFAULT NULL, `parameterId` int(11) NOT NULL, `gameId` int(11) NOT NULL, `createdAt` datetime DEFAULT CURRENT_TIMESTAMP, `createdBy` int(11) DEFAULT NULL, `updatedAt` datetime DEFAULT CURRENT_TIMESTAMP, `updatedBy` int(11) DEFAULT NULL, `deletedAt` datetime DEFAULT NULL, `deletedBy` int(11) DEFAULT NULL, `active` tinyint(1) DEFAULT '1', ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
As you can see, it involves more than 100 columns, and it has more than four external IDs. On average, each of these tables contains at least 15 million rows.
One of the critical features of the frontend application is to let users analyze the aggregated values of hundreds of different sport parameters (e.g., passes, throws, blocks) coming from all the selected games of one or more seasons. We developed a backend API to perform a query on the table mentioned earlier to retrieve the data. Such a query was nothing more than a trivial SELECT returning from 10k to 20k rows. Then, this data is aggregated with a multi-thread process, stored in the Redis cache, and finally serialized in JSON and returned to the frontend application. From the first moment that the API receives a hit (and thus, before the result is available in the Redis cache) to completion, users must wait between two to four seconds.
This delay was unacceptable.
Let's now see what are the downsides of the approach just presented.
Most advanced ORMs abstract how they represent data at the database level. In other terms, the ORM performs the query, retrieves the desired data from the database, and takes care of transforming it into its application-level representation. This data transformation process happens behind the scene, but it undoubtedly represents an overhead. Although that process is usually negligible in terms of performance, it can quickly become a bottleneck for thousands of rows.
This slowdown is especially likely when using OO (Object Oriented) languages. Additionally, creating a new class instance takes time and resources. One way to limit the object size and heap usage might be to select only the strictly necessary set of columns. This approach would make each object lighter, even though the object creation process represents the main overhead. So the time spent performing this transformation process would not change significantly.
Performing simple operations like sum or average on arrays of objects containing thousands of elements is not performance-free. Although this does not compare to the time spent by the ORM to transform the data, it indeed represents an additional overhead. Fortunately, Java supports many thread-safe collections to perform operations concurrently. On the other hand, opening and managing threads are complex and time-consuming tasks.
Let's see how several SQL aggregate functions helped me solve the performance issue.
SQL aggregate functions allow you to calculate several rows and obtain one value as a result. Even though each SQL language has its own aggregate function, the most common ones are:
- COUNT(): returns a count of the number of rows selected
- MIN(): extracts the minimum value
- MAX(): extracts the maximum value
- SUM(): performs the sum operation
- AVG(): performs the average operation
They represent a potent and helpful tool when associated with the GROUP BY statement. Thanks to it, you can first group the desired data and then aggregate it by harnessing them. If you want to delve into MySQL aggregate functions, you can find all the supported ones here. I also recommend checking out this and this.
While SQL aggregation functions seemed promising, I did not know if they could make a difference before seeing them in action. Specifically, the application-level operation generated a data structure containing the average value on the value column and the sum of each areaX (with X from 1 to 144) column on each parameter chosen over the selected games. You can easily represent this in the following query:
SELECT SUM(`area1`) as `area1`, SUM(`area2`) as `area2`, SUM(`area3`) as `area3`, ... SUM(`area142`) as `area142`, SUM(`area143`) as `area143`, SUM(`area144`) as `area144`, AVG(`total`) as `total`, `parameterId` FROM `PositionalData` WHERE `parameterId` IN (:parameterIds) AND `gameId` IN (:gameIds) GROUP BY `parameterId`
As you can see, this query takes advantage of the SQL aggregate functions to return aggregate data at the database level. All this while filtering over the desired data using the IN statement on
parameterId and grouping it based on the same
parameterId. In other words, data is first filtered based on the selected game of the season and the desired parameters to analyze. Then, the resulting information is grouped by parameter and aggregated by the SQL aggregate functions.
Since that query involves GROUP BY, IN, and SQL aggregate statements, it might be slow. This potential slowness is why defining the proper indexes is so essential. In detail, the most critical and performance-effective index applied was the following one:
ALTER TABLE `PositionalData` ADD INDEX `PositionalData_parameterId_gameId` (`parameterId`, `gameId`) USING BTREE;
So should you always use aggregate functions? Some positives and negatives with this approach.
- Database-level aggregation is much faster than executing the same aggregation logic at the application level while looping on large arrays.
- Using queries involving SQL aggregate functions with the GROUP BY statement allows you to reduce the number of rows returned drastically. In detail, this let me go from about 10k rows to the number of rows equal to the number of parameters analyzed. Consequently, this makes the data transformation process performed by the ORM irrelevant in terms of time, preventing it from being a bottleneck.
- Aggregating at the database level allows you to take advantage of the performance benefits from the database cache when identical requests run. This setup can make it less crucial to have an application-level cache, leading to a lighter architecture.
- SQL aggregate functions run at SELECT time. When dealing with strongly typed programming languages, the ORM needs to know the type the result will have. And not all the ORMs allow you to define easily, sometimes even restricting SQL aggregate functions to native queries only. This reality means losing the advantage of abstraction introduced by the ORM and discourages their use.
- The query to extract the desired data involving SQL aggregate functions is always slower than a SELECT involving a simple WHERE clause. Nevertheless, the execution time should remain in the order of tenths of a second, and in any case, much less than to perform the same operation at the application level.
- The SQL aggregate operations available are usually limited to a dozen, of which only 5 or 6 are mathematical operations.
Let's compare the result in response time when calling the same API involving data aggregation with no cache and the same parameters.
- Response time when performing aggregation at the application level: ~2-4s
- Response time when performing aggregation at the database level: ~800ms
SQL aggregate functions are undoubtedly a great tool to take the performance to the next level when dealing with data science. Using them is easy and effective, although not all the ORM can fully or natively support them. Either way, knowing how to take advantage of them may become essential to improve performance, and explaining it through a real-world case study was why I wrote this article!