⏱ Read from the nearest peer in a multi-region database 🚀 with followers read

In the previous post I have setup a YugabyteDB universe over 3 regions: eu-west-1, us-east-2a, ap-south-1. In order to keep things simple, I've set only one node in each region. The goal was to show the latency when a distributed query reads from multiple tablets.

The challenge in this case is to ensure High Availability, with sync replication over the cluster, without penalizing performance by the inherent latency.

  • eu-west-1 to us-east-2 is about 100 milliseconds
  • eu-west-1 to ap-south-1 is about 120 milliseconds
  • us-east-2 to ap-south-1 is about 220 milliseconds

With leaders in EU and connecting in EU, reading 100 rows takes less than 1 millisecond from a single tablet and 4 milliseconds when hash partitioned to many. Writing has to wait for the quorum: 120 millisecond when one tablet and 500 ms when the table is distributed to 3 tablets.

Still with leaders in EU, I connected to US, reading from one tablet was 80 millisecond, and 500 ms from the 3 tablets. The writes were 300 millisecond from single tablet and 1200 ms for multiple tablets. The queries are in the previous post. I'll run then same.

Read from remote region

This is the same as in the previous post, connected to US, but I'm doing it again in case the speed of light changed during the night 😂. Note that I run them several times to be sure there's no cache effect (the first parse after a connection will have to get the dictionary metadata from the yb-master which has its leader in the preferred region). I want to measure the network latency only.

Well, one thing is different, I did the previous test with version 2.9 but, as I want to show a new feature, I've upgraded the cluster to 2.11 (with YugabyteDB upgrade is easy: rolling restart the nodes with the new version).

\c postgres://[email protected]:5433/yugabyte

psql (15devel, server 11.2-YB-2.11.2.0-b0)

yugabyte=# -- read from one tablet:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;
                                                                                                                                                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.12..4.13 rows=1 width=16) (actual time=78.594..78.595 rows=1 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=4) (actual time=78.539..78.576 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.068 ms
 Execution Time: 78.667 ms
(5 rows)

yugabyte=# -- write in one tablet:
yugabyte=# explain analyze update franck_range set v=v+1 where k between 100 and 100+99;
                                                                                                                                                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Update on franck_range  (cost=0.00..4.12 rows=1 width=40) (actual time=78.939..78.939 rows=0 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=40) (actual time=78.505..78.560 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.064 ms
 Execution Time: 317.764 ms
(5 rows)

yugabyte=# -- read from all tablets:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.51 rows=1 width=16) (actual time=469.470..469.470 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=4) (actual time=78.208..469.447 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.068 ms
 Execution Time: 469.538 ms
(6 rows)

yugabyte=# -- write in all tablets:
yugabyte=# explain analyze update franck_hash set v=v+1 where k between 100 and 100+99;

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Update on franck_hash  (cost=0.00..107.50 rows=1000 width=40) (actual time=1128.763..1128.763 rows=0 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..107.50 rows=1000 width=40) (actual time=78.205..1128.140 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.062 ms
 Execution Time: 1357.659 ms
(6 rows)

Ok, so same results as yesterday. I like predictable performance.
Note that the query planner chooses an Index Scan from the range sharded table, and a Seq Scan for the hash sharded one. This is not surprising because I'm using a range query. As I'm comparing the time in the following test, I'll check that I'm using the same execution plan.

Now connecting to Mumbai, which is a bit further from the leaders (that are still in the preferred region EU):

\c postgres://[email protected]:5433/yugabyte

psql (15devel, server 11.2-YB-2.9.1.0-b0)

yugabyte=# -- read from one tablet:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;
                                                                                                                                                                                                                                                                  QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.12..4.13 rows=1 width=16) (actual time=119.536..119.537 rows=1 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=4) (actual time=119.463..119.504 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.068 ms
 Execution Time: 119.598 ms
(5 rows)

yugabyte=# -- write in one tablet:
yugabyte=# explain analyze update franck_range set v=v+1 where k between 100 and 100+99;
                                                                                                                                                                                                                                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Update on franck_range  (cost=0.00..4.12 rows=1 width=40) (actual time=119.933..119.933 rows=0 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=40) (actual time=119.471..119.529 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.068 ms
 Execution Time: 321.654 ms
(5 rows)

yugabyte=# -- read from all tablets:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;

                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.51 rows=1 width=16) (actual time=715.655..715.655 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=4) (actual time=119.124..715.617 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.066 ms
 Execution Time: 715.724 ms
(6 rows)

yugabyte=# -- write in all tablets:
yugabyte=# explain analyze update franck_hash set v=v+1 where k between 100 and 100+99;

                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Update on franck_hash  (cost=0.00..107.50 rows=1000 width=40) (actual time=1510.608..1510.608 rows=0 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..107.50 rows=1000 width=40) (actual time=119.196..1509.954 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.063 ms
 Execution Time: 1749.954 ms
(6 rows)

All takes more time, the worst, when scanning 1000 rows distributed to many tablets, is 800 milliseconds to read and 1800 ms to write.

What if we still want the leaders located in the preferred region, because most of our activity is there, but allow faster response time for those connected to other region? We cannot reduce the network latency, but we can reduce our requirements. Reading from the leaders is the only way to be sure to get the latest changes (Consistency - C - in the CAP theorem). But do we need that?

Follower reads

I'm not talking about eventual consistency as we can see in AP systems where, in case of a region down for minutes or hours (network partition - P), you continue to read stale values without raising an error (full availability - A) without really knowing how stale are our reads. Usually this eventual consistency is just a few milliseconds when everything works as expected. But it becomes unbounded in case of network failure. With YugabyteDB follower reads, we can accept stale results, but bounded to a given time window.

yugabyte=# show yb_read_from_followers;
 yb_read_from_followers
-----------------------------
 off

yugabyte=#
show yb_follower_read_staleness_ms;
 yb_follower_read_staleness_ms
------------------------------------
 30000

I'm enabling follower reads by setting yb_read_from_followers=on and this currently works only for read only transactions, which I set with set default_transaction_read_only = on:

\c postgres://[email protected]:5433/yugabyte

psql (15devel, server 11.2-YB-2.9.1.0-b0)

yugabyte=# set yb_read_from_followers=on;
SET

yugabyte=# set default_transaction_read_only = on;
SET

Same from Mumbai, but only the selects of course (the update would fail with ERROR: cannot execute UPDATE in a read-only transaction):

yugabyte=# -- read from one tablet:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;
                                                                                                                                                                                                                                                                                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.12..4.13 rows=1 width=16) (actual time=1.421..1.421 rows=1 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=4) (actual time=1.359..1.399 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.068 ms
 Execution Time: 1.476 ms
(5 rows)

yugabyte=# -- read from all tablets:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;

                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.51 rows=1 width=16) (actual time=5.698..5.698 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=4) (actual time=1.030..5.680 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.064 ms
 Execution Time: 5.761 ms
(6 rows)

With follower reads, I'm back to the same performance as when I was connected with all leaders on the same region.

I'll add a timestamp column to the table and run an update every second (on a session connected to Ireland):

yugabyte=# alter table franck_hash 
           add column last_change timestamptz;
ALTER TABLE

yugabyte=# update franck_hash set v=v+1, last_change=now() 
           where k between 100 and 100+99;
UPDATE 100
yugabyte=# \watch 5

From the session connected to Mumbai, with set yb_read_from_followers=on and default_transaction_read_only=on I query every 5 seconds:

yugabyte=# select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;

  min  |  max  |      lag
------------+-------+----------------
 34849 | 34849 | 00:00:33.52739
(1 row)

yugabyte=# \watch 5

Tue 07 Dec 2021 11:14:48 AM GMT (every 5s)

  min  |  max  |       lag
------------+-------+-----------------
 34855 | 34855 | 00:00:31.303417
(1 row)

Tue 07 Dec 2021 11:14:53 AM GMT (every 5s)

  min  |  max  |       lag
------------+-------+-----------------
 34856 | 34856 | 00:00:31.304044
(1 row)

Tue 07 Dec 2021 11:14:58 AM GMT (every 5s)

  min  |  max  |       lag
------------+-------+-----------------
 34857 | 34857 | 00:00:31.304038
(1 row)

Tue 07 Dec 2021 11:15:03 AM GMT (every 5s)

  min  |  max  |       lag
------------+-------+-----------------
 34858 | 34858 | 00:00:31.303999
(1 row)

The gap is always more or less 30 seconds. Even with follower reads, we are still ACID, seeing a consistent result. And, to make it simple to use, and predictable, we don't try to get the most recent possible. What we do is read a snapshot from the past, 30 seconds ago, so that we are sure that all writes from this point in time have reached all the followers.

But what means 30 seconds on a distributed SQL database? On servers that are spread across the world, and the time being synchronized with NTP? YugabyteDB runs on many platforms and cannot rely on atomic clocks like Google Spanner. Any time is subject to clock skew. The cluster has a max_clock_skew between two servers defined with a default 500 milliseconds. This is what limits the minimal staleness:

yugabyte=# set yb_follower_read_staleness_ms=1;
ERROR:  cannot enable yb_read_from_followers with a staleness of less than 2 * (max_clock_skew = 500000 usec)

I'll set 2 seconds:

\c postgres://[email protected]:5433/yugabyte

psql (15devel, server 11.2-YB-2.9.1.0-b0)

yugabyte=# set yb_read_from_followers=on;
SET

yugabyte=# set default_transaction_read_only = on;
SET

yugabyte=# set yb_follower_read_staleness_ms=2000;
SET

While the update is running every 0.5 seconds, here are my follower reads:

\timing on

yugabyte=# select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;

  min  |  max  |       lag
------------+-------+-----------------
 36033 | 36033 | 00:00:03.055112
(1 row)

Time: 133.398 ms

yugabyte=# select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;

  min  |  max  |       lag
------------+-------+-----------------
 36035 | 36035 | 00:00:02.785808
(1 row)

Time: 133.090 ms

yugabyte=# select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;

  min  |  max  |       lag
------------+-------+-----------------
 36036 | 36036 | 00:00:02.946859

(1 row)

Time: 133.182 ms
yugabyte=# select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;

  min  |  max  |       lag
------------+-------+-----------------
 36037 | 36037 | 00:00:03.107404
(1 row)

Time: 132.853 ms

The result is always consistent. The update increases the value for all rows, and all rows have the same value.

The response time that was in 1500 milliseconds when reading from the leaders is now 120 milliseconds. This is better. Why not down to the 5 milliseconds I have when connected to the region with all leaders?

Never let a question unanswered. I was lazy using \timing instead of explain analyze:

yugabyte=# explain analyze select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.52 rows=1 width=24) (actual time=6.579..6.579 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=12) (actual time=1.318..6.559 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.091 ms
 Execution Time: 6.641 ms
(6 rows)

Time: 133.081 ms

The query takes 6 milliseconds as expected when reading from the local region. But as I didn't move to Mumbai to do this demo, I'm connected from Switzerland.

Of course, when reading from followers, you should connect in your region.

Unbalanced follower regions

In my simple cluster with one node on each region, and 3 nodes for RF=3 replication factor, all tablet followers are balanced. Let's see what happens when scaling out US, adding 5 nodes:

Now, I'm connecting to Mumbai, with all leaders in Ireland, and more nodes in Ohio:

\c postgres://[email protected]:5433/yugabyte

psql (15devel, server 11.2-YB-2.9.1.0-b0)

yugabyte=# set yb_read_from_followers=on;
SET

yugabyte=# set default_transaction_read_only = on;
SET

yugabyte=# set yb_follower_read_staleness_ms=2000;
SET

yugabyte=# -- read from one tablet:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;

yugabyte=# explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;
                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.12..4.13 rows=1 width=16) (actual time=1.380..1.381 rows=1 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=4) (actual time=1.310..1.351 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.064 ms
 Execution Time: 1.432 ms

yugabyte=# -- read from all tablets:
yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;

yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;
                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.51 rows=1 width=16) (actual time=6.075..6.076 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=4) (actual time=1.196..6.059 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.082 ms
 Execution Time: 6.148 ms

With the same latency in 6 milliseconds, obviously all followers were there in the same region. This is because the tablet peer placement ensures that they are spread into all regions, see the min_num_replicas=1 in the replication placement info:

Even with many nodes in Ohio, we want followers in multiple region to be resilient to a region outage. You may think that a full region outage is rare. While writing this, us-east-1 is down for an hour, so, yes, this definitely happens.

follower reads on leader region

Last test, if I connect to the region where I have all leaders, in Ireland, obviously the reads are done locally from the leaders:

\c postgres://[email protected]:5433/yugabyte
psql (15devel, server 11.2-YB-2.11.2.0-b0)

set yb_read_from_followers=on;
set default_transaction_read_only = on;
set yb_follower_read_staleness_ms=30000;

yugabyte=# select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;
  min  |  max  |       lag
------------+-------+-----------------
 36912 | 36912 | 00:00:31.155144
(1 row)

yugabyte=# explain analyze select min(v),max(v),now()-max(last_change) lag from franck_hash where k between 100 and 100+99;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.52 rows=1 width=24) (actual time=6.715..6.716 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=12) (actual time=1.386..6.680 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.121 ms
 Execution Time: 6.793 ms
(6 rows)

So finally what is enabled by yb_read_from_followers=on is a read-only transaction snapshot from yb_follower_read_staleness_ms millisecond before the current time, reading from the closest tablet peer, which can be the leader or a follower.

16