Boost Secondary Index Queries with Index Only Scan

Originally published at https://blog.yugabyte.com with YugabyteDB in mind, because the benefit of Index Only Scan is huge in a distributed database. It also applies to PostgreSQL, knowing that the hops to the table pages are saved only for the vacuumed part of the table.

A distributed SQL database reads from remote nodes, which increases the need for optimal data access. From my 20 years of experience as a database consultant, here is the most overlooked optimization technique for any database that supports it: performance-critical queries should find their columns from the index structure, without making extra hops to the table.

Many myths against SQL, such as “joins don’t scale” or “analytic queries need a separate column store”, have their source in bad index design. Contrary to a commonly used analogy, a database index is not like the index in a book. When you search for a keyword at the end of a book, an index entry will probably send you to one main page or, maybe two or three additional ones if you are not lucky. Only the words with small occurrences make their way to the book index, as it would be inefficient to list hundreds of pages for the most popular words. That’s different in a database, where we don’t skip any value.

With the exception of partial indexes, all table values from the indexed columns have their index entries. In SQL, you are allowed to query a range of values for which you will retrieve hundreds or thousands of rows. The index-to-table pages method (“Index Scan”), which is good enough for a few rows, is inefficient for a large range. But when the only alternative is a full scan (“Seq Scan”) on a big table, there’s no good access path offered to the query planner.

Fortunately, the query planner can come with a solution (“Index Only Scan”), but you need to provide the right index definition. In this blog post, we’ll explore how “Index Only Scan” can boost secondary index queries in a distributed SQL database.

Index Scan: a real-life example

Here is the classical CUSTOMER – ORDERS schema for a company selling coffee capsules, online and in local shops:

create table customers (
 customer_id     bigint constraint cust_pk primary key,
 customer_name   text   constraint cust_uk unique
);
create table orders (
 order_id bigint primary key,
 customer_id bigint references customers,
 order_date date,
 order_amount decimal
);

For welcoming loyal customers coming to the shop, the cashier — as soon as the customer identification is scanned (I’ll use the customer_name here for simplicity, even if we know it is not the best fit for a unique index) — gets an indication of the amount bought in the past 10 years. Here is the query:

select sum(ord.order_amount)
from customers cus join orders ord on cus.customer_id=ord.customer_id
where cus.customer_name='George Clooney'
and order_date > now() - interval '10 years';

The predicate on “customer_name” will use the index on it (implicitly created by the unique constraint here) and that’s efficient as it retrieves only one row. Obviously you don’t want a full table scan on ORDERS, as this is about one customer out of millions, and you have created an index on the foreign key:

create index ord_cust_id on orders(customer_id);

This is what most people do, and is often generated by a data modeling tool or JPA auto-ddl, from the foreign key definition. Developers think it is the right index because it is used by the query. Here is the explain(verbose) of it:

 Aggregate
   Output: sum(ord.order_amount)
   ->  Nested Loop       Output: ord.order_amount
       ->  Index Scan using cust_uk on public.customers cus
             Output: cus.customer_id, cus.customer_identification
             Index Cond: (cus.customer_identification = 'George Clooney'::text)
       ->  Index Scan using ord_cust_id on public.orders ord
             Output: ord.order_id, ord.customer_id, ord.order_date, ord.order_amount
             Index Cond: (ord.customer_id = cus.customer_id)
             Filter: (ord.order_date > (now() - '10 years'::interval))

When this is slow, developers may try to remove the join, thinking the slowness holds there. But the join is not a problem: a one-to-many join is a single iteration in the nested loop outer table. The problem is in the details of the inner table access. You see an index access, and that’s good. But don’t stop there. Is this index used efficiently?

Moving to Index Only Scan

Now, let’s think about how data is stored. This company has millions of customers. They order capsules regularly, but not frequently. Let’s say this loyal customer places an order every month. That means you’ll have to fetch 120 rows. That doesn’t seem like a lot, but think about where the rows are stored. They are scattered within the whole table because they arrived through those 10 years, interleaved with a load of millions of other orders.

If the rows were cached in memory, that would be fast. But storing 10 years of order items in RAM, just in case one of those customers comes up to the shop, would not be cost efficient. Those rows will be read from disks. And in a distributed SQL database, they may have to be fetched from multiple nodes.

This is a case — with shops in many countries — where geo-distribution makes sense. The latency adds-up, and even if in milliseconds, hundreds of them bring the response time over a second. From a business perspective, we would like this information to come fast for the most loyal customers, but the best customers are also those with a lot of orders.

You may think that you need to replicate this information to another database service dedicated to this analytic purpose. But this adds complexity, more code, and additional costs. In a relational database, data redundancy for performance purposes is a built-in automated feature provided by indexes. Luckily, all you need to do is define the right index for this. The solution is simple and has many names: “covering index”, “include index”, “projection index”, “fat index” and even “Tapio index” from the name of the author of “Interscience Relational Database Index Design and the Optimizers” (Tapio Lahdenmäki) who explained this in detail.

The SQL syntax is pretty simple:

create index ord_cust_new on orders(customer_id, order_date desc)
 include (order_amount);

That’s it. The columns have been added in two places in the CREATE INDEX statement. You add more columns to the index, and then you define which of them are used for access or for projection only. Here, “customer_id” and “order_date” are used to filter a range of rows based on the condition in the WHERE clause, so they are both required in the index key definition. As there’s a date range predicate on the date, it better fits in the index key than in the additional included columns. On the other hand, the “order_amount”, which is used only for the projection in the SELECT clause, doesn’t need to be part of the key, and including it out of the index key reduces the overhead on index maintenance.

Here is the subtle difference in the explain plan:

 Aggregate
   Output: sum(ord.order_amount)
   ->  Nested Loop
      Output: ord.order_amount
      ->  Index Scan using cust_uk on public.customers cus
            Output: cus.customer_id, cus.customer_identification
            Index Cond: (cus.customer_identification = 'George Clooney'::text)
      ->  Index Only Scan using ord_cust_new on public.orders ord
            Output: ord.customer_id, ord.order_date, ord.order_amount
            Index Cond: ((ord.customer_id = cus.customer_id) AND (ord.order_date > 
(now() - '10 years'::interval)))

As you can see, “Index Only Scan” has replaced “Index Scan”. This means that we have all information from the index, without having to fetch the many table rows that are scattered into the multiple nodes and disks. You see the whole WHERE clause predicate in “Index Condition”, and you know that all columns in “Output” were available from the index because of the “Index Only” access.

A larger index, but not a new one

It is important to note that we’ve created another index here, to show this tuning method, and that the query planner will be choosing it. But this new index should replace the other. As long as the previous columns stay first in the key column list, adding new columns adds more access paths, but still serves the previous ones. This means that in our example I can:

drop index ord_cust_id;

And a query on “customer_id” only will use the new index:

explain select count(*) from orders where customer_id=1;
                                      QUERY PLAN
---------------------------------------------------------------------
 Aggregate  (cost=15.50..15.51 rows=1 width=8)
   ->  Index Only Scan using ord_cust_new on orders  (cost=0.00..15.25 
rows=100 width=0)
      Index Cond: (customer_id = 1)

By keeping the same number of indexes, the insert and delete operations are not impacted, and the same access paths are allowed. This technique should not bring regression on other queries as long as the leading part of the indexed columns are the same as before, especially when the additional columns are not updated frequently. The order of columns in the “include” list does not matter, so one fat index can cover multiple queries.

When to use Index Only Scan

Now, you may have heard some myths about fat indexes. One is about the overhead to maintain them when rows are inserted, deleted or updated. But you need to think about it in context. For INSERT and DELETE, you still have the same index maintenance because you need the index anyway. This one just has more columns. For UPDATE, the index maintenance occurs when you change the indexed column value. In our example, the ORDERS amount will not change once inserted, so this concern is irrelevant. We’re using this optimization because the value is not changed frequently and is queried for many rows.

Another myth is about the size of the index. Yes, it is larger. But, again, think in context. The few additional bytes per row on disk is not a problem here. The size matters when it comes to memory because memory on a node is limited. And, yes, those index entries take more space in the cache. But think about what happens with the minimal index I had before: each query was bringing into memory hundreds of table rows, with all their columns, for non-frequent usage. There’s definitely a huge improvement on cache usage with Index Only Scan.

Of course, you will not inflate all indexes to cover all your queries. First, you need to focus on the performance-critical ones where you want to keep single-digit millisecond response time. Then, this is not required when reading one or two rows. Fat indexes are for range scans that have to read hundreds or thousands of rows. And then, you still have the agility to decide which columns are worth adding to the index key or as an included column. Columns where selective filters are applied are the most important ones because, even if you have to go to the table, you will have less rows to read from it. This is why I’ve added the “order_date” in this example: no need to read orders older than 10 years.

In addition, rather than just listing it in “include”, I’ve added it to the index key as a range (asc/desc) so that this filtering is done by the index access. Another advantage is that rows are returned already sorted, so it can help for queries that display all orders rather than the sum. A good index should help for many queries.

Beyond the filtering, you want to avoid table access when reading a lot of rows from a table. This is where you include in the index all the columns used by the query. The latter is an additional optimization for performance critical queries. With a complex data model serving multiple use-cases, it is important to understand these nuances between minimal index and indexes covering all queries, as a balance between accelerating reads without penalizing updates.

This also means that “SELECT *” is almost always a bad idea. You should only select the columns that you will need in order to benefit from Index Only Scan. And this recommendation is not only for SELECT. An UPDATE that lists all the table columns will be inefficient. This is why you need to set dynamic-update=true in Hibernate, which is unfortunately not the default.

Covering functions in the index

So far, the index examples we have used are based on columns, but YugabyteDB — an open source, distributed SQL database for transactional applications — also supports functional or expression based indexes. Let’s say we have a query like this:

select sum(ord.order_amount)
from customers cus join orders ord on cus.customer_id=ord.customer_id
where cus.customer_name='George Clooney'
and extract(dow from order_date)=0 /* Sunday */;

And define an index on (customer_id, (extract(dow from order_date))) to get fast access to those rows. To change it to a covering index, it is not sufficient to add include (order_amount) as I did above. We use the PostgreSQL query planner that doesn’t analyze — at optimization time — which columns are used in the expression. Then, if you want the expression to be covered, you need to add all columns that are used by the function. Here is the index that covers the query:

create index ord_cust_new
 on orders(customer_id, (extract(dow from order_date)))
 include (order_amount, order_date);

The execution plan shows an Index Only Scan with this index:

      ->  Index Only Scan using ord_cust_new on orders ord
                 Index Cond: ((customer_id = cus.customer_id) AND 
((date_part('dow'::text, (order_date)::timestamp without time zone)) = 
'0'::double precision))

Without the standalone order_date added, an Index Scan would have been executed even if not required to get the result.

But what about primary indexes?

I explicitly mentioned secondary indexes in the title, but what about the primary key? The good thing is that, in YugabyteDB, you don’t care. An “Index Scan” on the primary key finds all columns without any additional work because the whole table is stored in the LSM-tree structure. Do not expect an “Index Only Scan” from the primary index, because it is already optimized by the DocDB storage. This is different for PostgreSQL, or Oracle heap tables, where all indexes are secondary indexes and always need an extra hop to the table if they are not covering all query columns. YugabyteDB is primarily optimized for OLTP with primary key access. The secondary index optimization discussed here allows the additional analytic queries — which are part of many OLTP applications — for running efficiently within the same OLTP database.

Further reading

28