"Rows Removed by Index Recheck" on YugabyteDB Index Scan

TL;DR: the problem is fixed, see below, but this post can still help to understand how YugabyteDB works, and how to troubleshoot those kind of issues

This quick post in about:

Index Scan using ...
   Index Cond: (d = ANY ('{...}'::integer[]))
   Rows Removed by Index Recheck: ...

The Rows Removed by Index Recheck means that the Index Cond returned more rows than what the exact predicate should return. Think of it like a fast pre-filter which may return false positive. The Index Scan has removed 42 rows before returning its rows=3 rows result. This means that the index access method (the query to DocDB here) returned 45 rows. But why? In PostgreSQL you see "Rows Removed by Index Recheck" with "Bitmap Index Scan" lossy blocks for small workarea memory. YugabyteDB uses the PostgreSQL query layer, and then the same message, but different indexes (sharding and LSM Tree), so this message has a different cause.

I create a small table with 4 columns having identical values:

create table franck (a int primary key, b int, c int, d int);
CREATE TABLE

insert into franck
 select n, n, n, n from generate_series(1,1000) n;
INSERT 0 1000

I create 3 indexes because I want to show the behavior with HASH and RANGE sharding, and also show which one is the default when not mentioned:

create index franck_default on franck(b);
CREATE INDEX
create index franck_hash    on franck(c hash);
CREATE INDEX
create index franck_range   on franck(d asc);
CREATE INDEX

The default is HASH for the first column:

Table "public.franck"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           | not null |
 b      | integer |           |          |
 c      | integer |           |          |
 d      | integer |           |          |
Indexes:
    "franck_pkey" PRIMARY KEY, lsm (a HASH)
    "franck_default" lsm (b HASH)
    "franck_hash" lsm (c HASH)
    "franck_range" lsm (d ASC)

Now, a simple IN() query:

explain analyze select * from franck where b in (48,64,92);

                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Index Scan using franck_default on franck  (cost=0.00..5.24 rows=10 width=16) (actual time=7.423..9.370 rows=3 loops=1)
   Index Cond: (b = ANY ('{48,64,92}'::integer[]))

This is the most efficient access: get the rows for each value in the IN() clause. This index is HASH sharded, the default for the first column, and the index with an explicit HASH shows the same without index recheck:

explain analyze select * from franck where c in (48,64,92);

                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using franck_hash on franck  (cost=0.00..5.24 rows=10 width=16) (actual time=51.567..52.354 rows=3 loops=1)
   Index Cond: (c = ANY ('{48,64,92}'::integer[]))

Now, here is the same query but with an ascending RANGE sharding:

explain analyze select * from franck where d in (48,64,92);

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using franck_range on franck  (cost=0.00..5.24 rows=10 width=16) (actual time=4.361..4.388 rows=3 loops=1)
   Index Cond: (d = ANY ('{48,64,92}'::integer[]))
   Rows Removed by Index Recheck: 42

This is where 45 rows were returned from the storage layer (DocDB) to the query layer (YSQL) in a lossy filtering. And 42 rows out of them are filtered after the access to finally return 3 rows.

Why 45 rows? Here are the steps I'll follow in order to understand (I'm writing this post to show how to troubleshoot this kind of issue):

  • get the postgresql logfile name
  • get the table and index oid in hexadecimal
  • set the tracing of calls from PostgreSQL to DocDB
  • run my query where I see lossy Index Scan
  • grep the trace for these oid in hexadecimal
select pg_current_logfile();

                            pg_current_logfile
-------------------------------------------------------------------------------
 /home/opc/var/data/yb-data/tserver/logs/postgresql-2021-12-06_142453.log

yugabyte=# select lpad(to_hex(oid::int),8,'0')
 from pg_class where relname in ('franck','franck_range');

   lpad
---------------
 000040a9
 000040b0

set yb_debug_log_docdb_requests=on;
SET

select * from franck where d in (48,64,92);

 a  | b  | c  | d
---------+----+----+----
 48 | 48 | 48 | 48
 64 | 64 | 64 | 64
 92 | 92 | 92 | 92

\! awk '/000040a9|000040b0/gsub(/ batch_arguments/,"\n&")' /home/opc/var/data/yb-data/tserver/logs/postgresql-2021-12-06_142453.log

2021-12-06 20:22:00.379 UTC [17112] LOG:  statement: select * from franck where d in (48,64,92);

I1206 20:22:00.379900 17112 pg_session.cc:433]
Applying operation: PGSQL_READ 
client: YQL_CLIENT_PGSQL
stmt_id: 43717504 schema_version: 0 
targets { column_id: 1 } 
column_refs { ids: 1 } 
is_forward_scan: true 
is_aggregate: false 
limit: 1024 
return_paging_state: true 
table_id: "000033c00000300080000000000040b0" 
condition_expr { condition { op: QL_OP_AND operands
 { condition { op: QL_OP_IN operands { column_id: 0 }
  operands { value {
   list_value {
    elems { int32_value: 48 }
    elems { int32_value: 64 }
    elems { int32_value: 92 }
   }
  } }
 } }
} }

I1206 20:22:00.381086 17112 pg_session.cc:433] 
Applying operation: PGSQL_READ 
client: YQL_CLIENT_PGSQL 
stmt_id: 43716928 schema_version: 6 
targets { column_id: 0 } targ
ets { column_id: 1 } targets { column_id: 2 } targets { column_id: 3 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 ids: 2 ids: 3 } 
is_forward_scan: true is_aggregate: false limit: 1024 return_paging_state: true ysql_catalog_version: 67 table_id: "000033c00000300080000000000040a9" 
ybctid_column_value { value { binary_value: "GYUH\200\000\0000!!" } }
 batch_arguments { order: 0 ybctid { value { binary_value: "GYUH\200\000\0000!!" } } }
 batch_arguments { order: 1 ybctid { value { binary_value: "G\032\247H\200\000\0001!!" } } }
 batch_arguments { order: 2 ybctid { value { binary_value: "GT\013H\200\000\0002!!" } } }
 batch_arguments { order: 3 ybctid { value { binary_value: "GQ\"H\200\000\0003!!" } } }
 batch_arguments { order: 4 ybctid { value { binary_value: "G4nH\200\000\0004!!" } } }
 batch_arguments { order: 5 ybctid { value { binary_value: "G\215\357H\200\000\0005!!" } } }
 batch_arguments { order: 6 ybctid { value { binary_value: "G\221`H\200\000\0006!!" } } }
 batch_arguments { order: 7 ybctid { value { binary_value: "G5\032H\200\000\0007!!" } } }
 batch_arguments { order: 8 ybctid { value { binary_value: "G\317\321H\200\000\0008!!" } } }
 batch_arguments { order: 9 ybctid { value { binary_value: "G\246\350H\200\000\0009!!" } } }
 batch_arguments { order: 10 ybctid { value { binary_value: "G\207\216H\200\000\000:!!" } } }
 batch_arguments { order: 11 ybctid { value { binary_value: "G>\232H\200\000\000;!!" } } }
 batch_arguments { order: 12 ybctid { value { binary_value: "GkOH\200\000\000<!!" } } }
 batch_arguments { order: 13 ybctid { value { binary_value: "G\006\235H\200\000\000=!!" } } }
 batch_arguments { order: 14 ybctid { value { binary_value: "G\341\014H\200\000\000>!!" } } }
 batch_arguments { order: 15 ybctid { value { binary_value: "G\266\244H\200\000\000?!!" } } }
 batch_arguments { order: 16 ybctid { value { binary_value: "G pH\200\000\000@!!" } } }
 batch_arguments { order: 17 ybctid { value { binary_value: "Gg\230H\200\000\000A!!" } } }
 batch_arguments { order: 18 ybctid { value { binary_value: "G\241[H\200\000\000B!!" } } }
 batch_arguments { order: 19 ybctid { value { binary_value: "G\270JH\200\000\000C!!" } } }
 batch_arguments { order: 20 ybctid { value { binary_value: "G\246mH\200\000\000D!!" } } }
 batch_arguments { order: 21 ybctid { value { binary_value: "G\234vH\200\000\000E!!" } } }
 batch_arguments { order: 22 ybctid { value { binary_value: "G\210\232H\200\000\000F!!" } } }
 batch_arguments { order: 23 ybctid { value { binary_value: "Gj\375H\200\000\000G!!" } } }
 batch_arguments { order: 24 ybctid { value { binary_value: "Gv\220H\200\000\000H!!" } } }
 batch_arguments { order: 25 ybctid { value { binary_value: "GB3H\200\000\000I!!" } } }
 batch_arguments { order: 26 ybctid { value { binary_value: "G\344\332H\200\000\000J!!" } } }
 batch_arguments { order: 27 ybctid { value { binary_value: "G\262TH\200\000\000K!!" } } }
 batch_arguments { order: 28 ybctid { value { binary_value: "G\361\302H\200\000\000L!!" } } }
 batch_arguments { order: 29 ybctid { value { binary_value: "G`\317H\200\000\000M!!" } } }
 batch_arguments { order: 30 ybctid { value { binary_value: "G  H\200\000\000N!!" } } }
 batch_arguments { order: 31 ybctid { value { binary_value: "G\277\033H\200\000\000O!!" } } }
 batch_arguments { order: 32 ybctid { value { binary_value: "G\314\247H\200\000\000P!!" } } }
 batch_arguments { order: 33 ybctid { value { binary_value: "G\000\206H\200\000\000Q!!" } } }
 batch_arguments { order: 34 ybctid { value { binary_value: "G\334\205H\200\000\000R!!" } } }
 batch_arguments { order: 35 ybctid { value { binary_value: "Gj\013H\200\000\000S!!" } } }
 batch_arguments { order: 36 ybctid { value { binary_value: "G\263\002H\200\000\000T!!" } } }
 batch_arguments { order: 37 ybctid { value { binary_value: "G\364\356H\200\000\000U!!" } } }
 batch_arguments { order: 38 ybctid { value { binary_value: "G9bH\200\000\000V!!" } } }
 batch_arguments { order: 39 ybctid { value { binary_value: "G\235\372H\200\000\000W!!" } } }
 batch_arguments { order: 40 ybctid { value { binary_value: "G\252[H\200\000\000X!!" } } }
 batch_arguments { order: 41 ybctid { value { binary_value: "G-\376H\200\000\000Y!!" } } }
 batch_arguments { order: 42 ybctid { value { binary_value: "G=GH\200\000\000Z!!" } } }
 batch_arguments { order: 43 ybctid { value { binary_value: "G9\342H\200\000\000[!!" } } }
 batch_arguments { order: 44 ybctid { value { binary_value: "GntH\200\000\000\\!!" } } }

What the trace shows is:

  • the predicate is passed as an array with the 3 values: list_value { elems { int32_value: 48 } elems { int32_value: 64 } elems { int32_value: 92 } }
  • the table is read for each value, because, unfortunately, this Index Recheck is processed on the Index Scan result

The 45 rows are actually all the rows from the range between the minimum and maximum value of the IN() list:

select count(*) from franck where c between 48 and 92;
 count
------------
    45

This issue is observed on:

select version();
                                                   version
------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.11.0.1-b0 on x86_64-pc-linux-gnu, compiled by gcc (Homebrew gcc 5.5.0_4) 5.5.0, 64-bit

and tracked by #8023. If this is a problem for you, please update the git issue. However, a column that is queried with a list of discrete value should probably be indexed by HASH. If you need a RANGE index, and the unnecessary hops to the table are problematic, you can use a covering index. See: How a Distributed SQL Database Boosts Secondary Index Queries with Index Only Scan

Issue fixed with Hybrid Scan

I've written this blog post when helping a user who encountered this issue. I mentioned that many optimizations are ongoing. And this one was solved faster than I thought. Our development is working on Hybrid Scan, to optimize queries doing point access (like this IN list) and range scans, and provided a patch for our user. The occasion to run the same (this is not public release yet, I'm using a patched version, but that will come soon):

yb_demo_northwind=> \c yugabyte yugabyte
psql (15devel, server 11.2-YB-2.11.2.0-b0)
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# create table franck (a int primary key, b int, c int, d int);
CREATE TABLE
yugabyte=# insert into franck
yugabyte-#  select n, n, n, n from generate_series(1,1000) n;
INSERT 0 1000
yugabyte=# create index franck_default on franck(b);
CREATE INDEX
yugabyte=# create index franck_hash    on franck(c hash);
CREATE INDEX
yugabyte=# create index franck_range   on franck(d asc);
CREATE INDEX
yugabyte=# explain analyze select * from franck where b in (48,64,92);
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using franck_default on franck  (cost=0.00..5.24 rows=10 width=16) (actual time=30.754..32.133 rows=3 loops=1)
   Index Cond: (b = ANY ('{48,64,92}'::integer[]))
 Planning Time: 16.388 ms
 Execution Time: 32.218 ms
(4 rows)

yugabyte=# explain analyze select * from franck where c in (48,64,92);
                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Index Scan using franck_hash on franck  (cost=0.00..5.24 rows=10 width=16) (actual time=1.178..1.905 rows=3 loops=1)
   Index Cond: (c = ANY ('{48,64,92}'::integer[]))
 Planning Time: 1.100 ms
 Execution Time: 1.943 ms
(4 rows)

yugabyte=# explain analyze select * from franck where d in (48,64,92);
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using franck_range on franck  (cost=0.00..5.24 rows=10 width=16) (actual time=4.883..4.889 rows=3 loops=1)
   Index Cond: (d = ANY ('{48,64,92}'::integer[]))
 Planning Time: 1.164 ms
 Execution Time: 4.989 ms
(4 rows)

The last one is the one where I had this Rows Removed by Index Recheck.

In order to show the difference, I'll gather the number of rocksdb seeks (expect a blog post about this soon).

When running the first test which did a range scan from the minimum to maximum of the list, and then removing rows after the operation, I had 1 seek to the index and 45 seeks to the tables:

[opc@C demo]$ metrics_loop yb1.pachot.net 1 'seek$'

 1 yb2.pachot.net:yugabyte.franck_range/1b56d7ce52584fbfbb61f7d93bc4cb68:rocksdb_number_db_seek
45 yb2.pachot.net:yugabyte.franck/16c4aa0d9d5142e8936b824b4b52da8c:rocksdb_number_db_seek

The 45 are the access to the table that I had traced above.

Now, after patching my database, here is what I see:

[opc@C demo]$ metrics_loop yb1.pachot.net 1 'seek$'

 3 yb1.pachot.net:yugabyte.franck/1178d0549747481ebc8f8b0e7753d6b5:rocksdb_number_db_seek
 3 yb1.pachot.net:yugabyte.franck_range/3c4728a430644aa6a0d0bdc4a46b288b:rocksdb_number_db_seek

Now the IN() with 3 values is retreived with 3 seeks, and then 3 to the table. At the time of writing this, this new feature is not yet in the latest public release, but expect it soon.

20