πŸ˜πŸš€ Triggers & Stored Procedures for pure data integrity logic and performance

This post shows an hybrid approach between the "optimize for one use case" idea of document databases (where a single table holds all information) and "one database for many use-cases" (where relational data modeling allows multiple access patterns). I'm basing this post on a question by @Manish in our Slack forum, and will show a demo on YugabyteDB, but all code is valid with PostgreSQL.

The idea is to store some post content, by user id, and with a list of tags and groups (friend circle). From a user point of view, where the critical use cases are: inserting a new post, and retrieving posts by user, here is the table that fits the main case:

CREATE TABLE posts_by_user(
    user_id     bigint,
    post_id     bigint generated always as identity,
    group_ids   bigint[] null,
    tag_ids     bigint[] null,
    content     text     null,
    created_date timestamptz,
    PRIMARY KEY (user_id, created_date, post_id),
    UNIQUE      (user_id, post_id)
);

This follows a single-table data model, each row being a document with its content and lists of group_ids and tag_ids. In PostgreSQL, this creates a heap table and a secondary index for the primary key. In YugabyteDB, a table is stored clustered on the primary key, to allow fast point and range access without a secondary index. I didn't mention the sharding method in the CREATE statement, to keep the code compatible with PostgreSQL. YugabyteDB default is hash on the first column, and range on next. So this is equivalent to PRIMARY KEY (user_id HASH, created_date ASC, post_id ASC).

The business key is user_id, post_id and I enforce it with a UNIQUE constraint. But for the primary key, I'm adding the date. The drawback of adding created_date in primary key is full rewrite of the whole document in case the created_date is updated, which is probably not the case here. The advantage is to allow fast access to a time range when looking at one user posts. This is something to decide when knowing all access patterns. For YugabyteDB, it would be better to set the sharding options explicitly, with a descending order: PRIMARY KEY (user_id HASH, created_date DESC post_id ASC).

GIN indexes

Other use cases involve getting posts by tag_idor by group_id to fill the feed list. With this single-table design, I can create GIN indexes on those arrays:

create index posts_by_user_group_ids on posts_by_user using gin (group_ids);
create index posts_by_user_tag_ids   on posts_by_user using gin (tag_ids);

GIN indexes are supported in YugabyteDB as of version 2.11 and the feature roadmap is tracked in #7850

However, I cannot add the created_date in them. This is a limitation from PostgreSQL. Trying to add it in the INCLUDING clause will raise ERROR: access method "gin" does not support included columns and trying to add it in the indexed columns will raise ERROR: data type timestamp with time zone has no default operator class for access method "gin".

Because of this limitation, I'll not create those GIN indexes here for the queries that are based on range of created_date. However, there is still a place where GIN index may help, for text search on the post content. I described this in a previous post.

Normalization

So, what is the solution in a SQL database? Here is the reason why relational database were invented: allow efficient query for multiple access paths without compromising data integrity. Rather than storing the post tags and groups as an array within the posts, I can store each in their own table:

CREATE TABLE posts_by_tag(
    tag_id     bigint not null,
    user_id    bigint not null,
    post_id    bigint not null,
    created_date timestamptz not null,
    PRIMARY KEY (tag_id, created_date, user_id, post_id),
    UNIQUE      (tag_id, user_id, post_id),
    FOREIGN key (user_id, created_date, post_id) references posts_by_user (user_id, created_date, post_id) on delete cascade
);

CREATE TABLE posts_by_group(
    group_id   bigint not null,
    user_id    bigint not null,
    post_id    bigint not null,
    created_date timestamptz not null,
    PRIMARY KEY (group_id,created_date, user_id,post_id),
    UNIQUE      (group_id, user_id, post_id),
    FOREIGN key (user_id, created_date, post_id) references posts_by_user (user_id, created_date, post_id) on delete cascade
);

I've declared the foreign key to guarantee the consistency of data, and it contains the created_date. This is on purpose to be able to filter on a time range from this table, before doing to read the main table to get the content. This is why I didn't create the GIN indexes which didn't allow it. I've also declared the unicity of columns without the created_date to enforce data integrity. My goal here is to consider these posts_by_group and posts_by_tag tables like a secondary index. Exactly like the GIN index I wanted to build, but with additional columns to match the selectivity of the uses cases. I'll not update those additional tables directly but they will be automatically maintained when updating the main posts_by_user table. And this is where I need a trigger.

Trigger and Atomic Procedure

Ideally, the application code focuses on business logic. Data integrity should be fully implemented in the database. The best is with declarative constraints like the FOREIGN KEY I defined above. Or, when declarative is not possible, with triggers and procedures. And this is what I'm showing here.

The application will simply insert (or update or delete) on the main table posts_by_user.

I'm keeping the tag_ids and group_ids arrays there to get everything in one document when accessing to a post by its primary key. This also eases the logic to maintain the secondary tables posts_by_tag and posts_by_group because any DML on the main table will know the old and new value. Thanks to this, there is no need for other indexes on those secondary tables. The logic is simple, and can be optimized if needed, here I delete the entries for the old values and insert those for the new ones. This is exactly how a secondary index works, but here I'm coding it in a procedure [Please read the comments, there's a much better version thanks to Pavel]:

CREATE OR REPLACE FUNCTION posts_by_user_dml()
RETURNS TRIGGER AS
$$
declare
 loop_tag_id bigint;
 loop_group_id bigint;
begin
 if old.tag_ids is not null then 
     -- delete entries for old values
     foreach loop_tag_id in array old.tag_ids loop
      delete from posts_by_tag t
       where t.tag_id = loop_tag_id
         and t.user_id= old.user_id
         and t.post_id= old.post_id;
     end loop;
  end if;
 if new.tag_ids is not null then   
     -- insert entries for new values
     foreach loop_tag_id in array new.tag_ids loop
      insert into posts_by_tag(tag_id,user_id, post_id,created_date)
       values (loop_tag_id,new.user_id, new.post_id, new.created_date);
     end loop;
 end if;
 if old.group_ids is not null then 
     -- delete entries for old values
     foreach loop_group_id in array old.group_ids loop
      delete from posts_by_group t 
       where t.group_id = loop_group_id
         and t.user_id= old.user_id
         and t.post_id= old.post_id;
     end loop;
  end if;
  if new.group_ids is not null then 
      -- insert entries for new values
     foreach loop_group_id in array new.group_ids loop
      insert into posts_by_group(group_id,user_id, post_id,created_date)
       values (loop_group_id,new.user_id, new.post_id, new.created_date);
     end loop;
  end if;
 return new;
end;
$$
LANGUAGE plpgsql;

This procedure operates with the old and new records from a trigger. Here is the declaration of the trigger:

CREATE TRIGGER posts_by_user_dml
AFTER insert or update or delete ON posts_by_user
FOR EACH ROW
EXECUTE PROCEDURE posts_by_user_dml();

The ON DELETE is not necessary because this one is automated by the ON DELETE CASCADE foreign key constraint.

I usually prefer calling a procedure rather than a DML statement with some triggers behind it. This to avoid side effects hidden behind a SQL statement. But here a trigger is fine because the INSERT statement keeps its semantic. What happens here is just technical to maintain the secondary tables.

Unit Tests

Deploying code in the database does not exempt you from unit testing it. This is easy. There are 3 DML operations (DELETE, INSERT, UPDATE) and values can be null, empty array or array of integers.

Here is an example:

delete from posts_by_user;
insert into posts_by_user (user_id, group_ids, tag_ids, content, created_date)
values  (1,array[1,2,3],array[1,2,3],'x',date'2021-01-01');
insert into posts_by_user (user_id, group_ids, tag_ids, content, created_date)
values  (2,array[1,2,3],array[]::bigint[],'x',date'2021-01-01');
update posts_by_user set tag_ids=tag_ids||'{4}' where user_id=1;

And the validation of it, comparing the stored arrays with the ones built from the secondary tables:

with join_secondary as (
 select *,
  array(
  SELECT tag_id
  from posts_by_tag t
  where t.user_id=p.user_id
    and t.post_id=p.post_id
    and t.created_date=p.created_date
  ) tag_ids_secondary,
  array(
  SELECT group_id
  from posts_by_group g
  where g.user_id=p.user_id
    and g.post_id=p.post_id
    and g.created_date=p.created_date
  ) group_ids_secondary
 from posts_by_user p)
 select tag_ids<@tag_ids_secondary and group_ids<@group_ids_secondary
   and  tag_ids@>tag_ids_secondary and group_ids@>group_ids_secondary
   "valid?",* from join_secondary;

This shows:

yugabyte-#    "valid?",* from join_secondary;
 valid? | user_id | post_id | group_ids |  tag_ids  | content |         created_date          | tag_ids_secondary | group_ids_secondary
--------+---------+---------+-----------+-----------+---------+-------------------------------+-------------------+---------------------
 t      |       2 |       2 | {1,2,3}   | {}        | x       | 2021-12-20 10:18:56.922046+00 | {}                | {3,2,1}
 t      |       1 |       1 | {1,2,3}   | {1,2,3,4} | x       | 2021-01-01 00:00:00+00        | {3,4,2,1}         | {3,2,1}
(2 rows)

INSERT performance

Of course, maintaining the secondary tables has a cost on inserts, like any additional index. It is important to understand how it scales on a distributed database. Here is an example inserting 100 posts per users, for 10 users, with 20 tags and groups (chosen at random among 1000 ones):

truncate posts_by_user cascade;

with
users as (select generate_series(1,10) user_id),
posts as (select generate_series(1,100) post_id),
ids  as (select distinct (1000*random()*generate_series(1,20))::int id)
insert into posts_by_user (user_id,group_ids,tag_ids,content,created_date)
select user_id
 ,ARRAY(select id from ids) group_ids
 ,ARRAY(select id from ids) tag_ids
 ,'Lorem ipsum...' as content
 , now() - random() * interval '10 year'
from users,posts;
\timing on
\watch 0.1

I repeat this with \watch and in a small server this is about 8 seconds for 1000 posts, which means about 8 milliseconds per post:

Time: 10604.948 ms (00:10.605)
Mon 20 Dec 2021 11:29:17 AM GMT (every 0.1s)

INSERT 0 1000

Time: 8533.703 ms (00:08.534)
Mon 20 Dec 2021 11:29:25 AM GMT (every 0.1s)

INSERT 0 1000

Time: 8320.473 ms (00:08.320)
Mon 20 Dec 2021 11:29:34 AM GMT (every 0.1s)

INSERT 0 1000

Note that in my lab, I still have the GIN indexes created. This doesn't change the latency because all is distributed across many nodes that are not saturated. Same with the indexes that enforce the unique constraints. I've run this from 6 sessions connected to the 6 nodes of a RF=3 Multi-AZ YugabyteDB cluster (which means High Availability with 3-way replication to be resilient to one availability zone failure). The nodes are AWS c5.2xlarge. I've run 6 threads connected and loaded 1000 new posts per second. This is displayed as 3000 Ops/Sec because each insert is actually 3 inserts with my trigger maintaining the secondary tables:
1000 posts per second

Now the question is about how it scales. Assessing the scalability means understanding the time complexity.

I've traced the calls between the YSQL layer (the PostgreSQL code) and the DocDB layer (the YugabyteDB distributed storage and transaction):

yugabyte=# select to_hex(oid::int) as "0xOID", relname,relkind, reltuples from (select oid,relname, relkind,relnamespace,reltuples from pg_class) c natural join (select oid relnamespace from pg_namespace where nspname='public') n where relname like 'posts%' order by 1;

 0xOID |                   relname                   | relkind |  reltuples
-------+---------------------------------------------+---------+-------------
 433f  | posts_by_user_post_id_seq                   | S       |           1
 4341  | posts_by_user                               | r       |   4.644e+07
 4344  | posts_by_user_pkey                          | i       |   4.644e+07
 4346  | posts_by_tag                                | r       | 4.39074e+08
 4349  | posts_by_tag_pkey                           | i       | 4.39074e+08
 4116  | posts_by_tag_tag_id_user_id_post_id_key     | i       | 4.39074e+08
 4352  | posts_by_group                              | r       | 5.40224e+08
 4355  | posts_by_group_pkey                         | i       | 5.40224e+08
 4357  | posts_by_group_group_id_user_id_post_id_key | i       | 5.40224e+08
 430a  | posts_by_user_group_ids                     | i       |   4.644e+07
 4400  | posts_by_user_tag_ids                       | i       |   4.644e+07

yugabyte=# set log_statement='all';
SET

yugabyte=# set yb_debug_log_docdb_requests=true;
SET

yugabyte=# select pg_current_logfile();
                            pg_current_logfile
--------------------------------------------------------------------------
 /home/opc/var/data/yb-data/tserver/logs/postgresql-2021-12-20_000000.log
(1 row)

yugabyte=# \! grep --color=auto -E '^|^.*(pg_session.cc|STATEMENT:)|(PGS|Y)QL[_A-Z]+|[0-9a-f]{4}"|Flushing|Applying|Buffering' /home/opc/var/data/yb-data/tserver/logs/postgresql-2021-12-20_000000.log

yugabyte=# 2021-12-20 10:21:09.424 UTC [17389] LOG:  statement: insert into posts_by_user (user_id, group_ids, tag_ids, content, created_date)
        values  (2,array[1,2,3],array[]::bigint[],'x',date'2021-01-01');
I1220 10:21:09.425124 17389 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 31771648 schema_version: 0 targets { column_id: 1 } targets { column_id: 2 } targets { column_id: 3 } targets { column_id: 4 } targets { column_id: 5 } targets { column_id: 6 } targets { column_id: 7 } targets { column_id: -8 } column_refs { ids: 1 ids: 2 ids: 3 ids: 4 ids: 5 ids: 6 ids: 7 } is_aggregate: false limit: 1024 return_paging_state: true table_id: "000033e1000030008000000000000a30" index_request { targets { column_id: 3 } column_refs { ids: 3 } is_forward_scan: true is_aggregate: false range_column_values { value { uint32_value: 1259 } } range_column_values { value { uint32_value: 17217 } } range_column_values { value { int32_value: 2 } } table_id: "000033e1000030008000000000000a72" }
I1220 10:21:09.426597 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 30364144 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004341" schema_version: 0 ybctid_column_value { value { binary_value: "G\317\252I\200\000\000\000\000\000\000\002!I\200\000\000\000\000\000\000\003!" } } column_values { column_id: 2 expr { value { binary_value: "\001\000\000\000\000\000\000\000\024\000\000\000\003\000\000\000\001\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000" } } } column_values { column_id: 3 expr { value { binary_value: "\000\000\000\000\000\000\000\000\024\000\000\000" } } } column_values { column_id: 4 expr { value { string_value: "x" } } } column_values { column_id: 5 expr { value { int64_value: 662774400000000 } } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.426748 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 30530672 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004352" schema_version: 1 ybctid_column_value { value { binary_value: "G\355\251I\200\000\000\000\000\000\000\001!I\200\002Z\3120\255\240\000I\200\000\000\000\000\000\000\002I\200\000\000\000\000\000\000\003!" } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.426813 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 30441232 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004357" schema_version: 0 partition_column_values { value { int64_value: 1 } } range_column_values { value { int64_value: 2 } } range_column_values { value { int64_value: 3 } } range_column_values { value { } } column_values { column_id: 4 expr { value { binary_value: "G\355\251I\200\000\000\000\000\000\000\001!I\200\002Z\3120\255\240\000I\200\000\000\000\000\000\000\002I\200\000\000\000\000\000\000\003!" } } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.426929 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 30438432 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004352" schema_version: 1 ybctid_column_value { value { binary_value: "G\317\252I\200\000\000\000\000\000\000\002!I\200\002Z\3120\255\240\000I\200\000\000\000\000\000\000\002I\200\000\000\000\000\000\000\003!" } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.426992 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 30436192 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004357" schema_version: 0 partition_column_values { value { int64_value: 2 } } range_column_values { value { int64_value: 2 } } range_column_values { value { int64_value: 3 } } range_column_values { value { } } column_values { column_id: 4 expr { value { binary_value: "G\317\252I\200\000\000\000\000\000\000\002!I\200\002Z\3120\255\240\000I\200\000\000\000\000\000\000\002I\200\000\000\000\000\000\000\003!" } } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.427089 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 30294576 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004352" schema_version: 1 ybctid_column_value { value { binary_value: "G\010DI\200\000\000\000\000\000\000\003!I\200\002Z\3120\255\240\000I\200\000\000\000\000\000\000\002I\200\000\000\000\000\000\000\003!" } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.427141 17389 pg_session.cc:370] Buffering operation: PGSQL_WRITE client: YQL_CLIENT_PGSQL stmt_id: 32457600 stmt_type: PGSQL_INSERT table_id: "000033e1000030008000000000004357" schema_version: 0 partition_column_values { value { int64_value: 3 } } range_column_values { value { int64_value: 2 } } range_column_values { value { int64_value: 3 } } range_column_values { value { } } column_values { column_id: 4 expr { value { binary_value: "G\010DI\200\000\000\000\000\000\000\003!I\200\002Z\3120\255\240\000I\200\000\000\000\000\000\000\002I\200\000\000\000\000\000\000\003!" } } } column_refs { } ysql_catalog_version: 303
I1220 10:21:09.430510 17389 pg_session.cc:949] Flushing buffered operations, using transactional session (num ops: 7)

I'm not going into the details here. The important is to see that all PGSQL_WRITE operations to the main and secondary tables are all buffered together. They are distributed to each node (depending on the hash code from the primary key) but it is not a per-table fan-out. This is what is great with using stored procedures in distributed databases: the whole transaction can be processed with waiting the acknowledgement for individual statements.

Queries

I have 43 million posts loaded (the ANALYZE ran for 4 minutes)

Getting the the last 2-days posts for one user has to read only the main table, thanks to all being stored in it:

yugabyte=# explain analyze
           select posts_by_user.*
            from posts_by_user where user_id=1
            and created_date > now() - 2*interval '1 day'
           order by created_date desc;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan Backward using posts_by_user_pkey on posts_by_user  (cost=0.00..17.00 rows=100 width=120) (actual time=3.589..84.544 rows=2173 loops=1)
   Index Cond: ((user_id = 1) AND (created_date > (now() - '2 days'::interval)))
 Planning Time: 0.081 ms
 Execution Time: 85.111 ms
(4 rows)

This is the fastest execution plan you can have: read a range from one table, in the primary key order, requiring no additional sorting. It could even be a bit better if I had declared the created_date as DESC instead of ASC (I didn't to keep the PostgreSQL compatibility for the blog post). On PostgreSQL the same execution plan will have to read rows from the heap table, probably scattered, as one user doesn't post all at the same time.

So this returned 2173 posts in 85 milliseconds from a table that has 40 million posts. But the size of the table doesn't matter because it is a range scan.

Getting the posts by a list of tag requires a join from the secondary table:

yugabyte=# explain analyze
           select posts_by_user.*
            from posts_by_user
            join posts_by_tag
            using(user_id, created_date, post_id)
            where posts_by_tag.created_date 
                   > now() - interval '1 month'
                  and tag_id =1
            order by created_date desc limit 100
 ;

Even if there is a join, the time complexity of Index Access and Nested Loop time complexity depends on the result rather than the size of the table. This query is still in the 80 milliseconds:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..28.40 rows=100 width=120) (actual time=2.691..70.550 rows=100 loops=1)
   ->  Nested Loop  (cost=0.00..28.40 rows=100 width=120) (actual time=2.689..70.513 rows=100 loops=1)
         ->  Index Scan Backward using posts_by_tag_pkey on posts_by_tag  (cost=0.00..17.00 rows=100 width=24) (actual time=1.830..1.971 rows=100 loops=1)
               Index Cond: ((tag_id = 1) AND (created_date > (now() - '1 mon'::interval)))
         ->  Index Scan using posts_by_user_pkey on posts_by_user  (cost=0.00..0.11 rows=1 width=120) (actual time=0.670..0.670 rows=1 loops=100)
               Index Cond: ((user_id = posts_by_tag.user_id) AND (created_date = posts_by_tag.created_date) AND (post_id = posts_by_tag.post_id))
 Planning Time: 0.269 ms
 Execution Time: 71.185 ms

This execution plan gives a response time that is proportional to the number of posts per tag filtered by the secondary table (here 100 rows from the latest ones with tag 1 in the past month). In 71 milliseconds. I've run the same getting 1000 rows in 634 milliseconds. With Nested Loop, the join is under control as soon as you bound the inner query rows. And this is why I wanted all filtering criteria in the secondary tables.

When tracing the calls to the storage I see two PGSQL_READ operations:

I1220 11:01:53.145462 17389 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 33622848 schema_version: 1 partition_column_values { value { int64_value: 1 } } targets { column_id: 0 } targets { column_id: 2 } targets { column_id: 3 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 ids: 2 ids: 3 } is_forward_scan: true is_aggregate: false limit: 1024 return_paging_state: true ysql_catalog_version: 308 table_id: "000033e1000030008000000000004382" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int64_value: 692708513144764 } } } } } }
I1220 11:01:53.145519 17389 pg_session.cc:437] Applying operation: PGSQL_READ client: YQL_CLIENT_PGSQL stmt_id: 33622848 schema_version: 1 partition_column_values { value { int64_value: 2 } } targets { column_id: 0 } targets { column_id: 2 } targets { column_id: 3 } targets { column_id: 1 } targets { column_id: -8 } column_refs { ids: 0 ids: 1 ids: 2 ids: 3 } is_forward_scan: true is_aggregate: false limit: 1024 return_paging_state: true ysql_catalog_version: 308 table_id: "000033e1000030008000000000004382" condition_expr { condition { op: QL_OP_AND operands { condition { op: QL_OP_GREATER_THAN_EQUAL operands { column_id: 1 } operands { value { int64_value: 692708513144764 } } } } } }

Clearly, in a distributed SQL database, this should be used to gather only a few posts (as filtered on the secondary table columns) because those calls go to all nodes to read the content from the main table, and have to seek in many places. I'm using this secondary table as a secondary index and it is important not to add another level of fan-out. This table should be scanned on its primary key and filter as much as possible.

Now as I kept the GIN indexes, let's see what would happen:

yugabyte=# explain analyze
           select *
            from posts_by_user
            where created_date > now() - interval '1 month'
              and tag_ids @>'{1}'
order by created_date desc limit 100;
                                                                                                                                                     QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
------
 Limit  (cost=52.11..52.36 rows=100 width=120) (actual time=289549.350..289549.380 rows=100 loops=1)
   ->  Sort  (cost=52.11..54.61 rows=1000 width=120) (actual time=289549.348..289549.363 rows=100 loops=1)
         Sort Key: created_date DESC
         Sort Method: top-N heapsort  Memory: 128kB
         ->  Index Scan using posts_by_user_tag_ids on posts_by_user  (cost=4.00..13.89 rows=1000 width=120) (actual time=8720.517..289510.325 rows=250376 loo
ps=1)
               Index Cond: (tag_ids @> '{1}'::bigint[])
               Filter: (created_date > (now() - '1 mon'::interval))
               Rows Removed by Filter: 2845624
 Planning Time: 0.093 ms
 Execution Time: 289550.383 ms
(10 rows)

You see the problem: because the index does do not contain the created_date, this has to be filtered out later (Rows Removed by Filter). Does it matter? Test it with your use case and your data. And also with your YugabyteDB version. I'm on 2.11 here and maybe one day the GIN index defined can filter on the date. Because, being part of the primary key, it is possible to get it from the index entry. This is an advantage of YugabyteDB storing the tables in the primary key LSM tree: the secondary index reference the row by they primary key. The possibility to use it to filter may be related to #10169. On standard PostgreSQL you will see another join method for this one, but let's keep it for the next post.

Distributed SQL database considerations

Joining the two tables, by reading the one for tag or group first, and then fan-out read to all nodes to get the related posts, is still scalable when calls to the distributed storage are optimized by the SQL query layer. Here YugabyteDB, provides fully consistent global indexes, tables stored in the primary index, triggers to guarantee data integrity beyond what the declarative constraints can provide, procedures to batch many statements in one call... The same technique can be used with PostgreSQL, with additional heap fetches, but no cross-server calls. But maybe you don't need it because Bitmap Scan on GIN index may provide an acceptable response time. I'll run the same data and query on PostgreSQL in a next post.

I hope this use case illustrates a reason for triggers and stored procedures, even when you don't want any business logic in the database. This is only data logic for data integrity and performance reason, and this belongs to the SQL database.

47