Text Search example with the OMDB sample database

This post exposes some basic examples for one of the greatest feature of PostgreSQL: text search. In standard SQL, we can use LIKE, SIMILAR and REGEXP. But general text search cannot be optimized with simple B-Tree indexes on the column value. Text contain words, and indexing the text as a whole is not sufficient. Fortunately, PostgreSQL provides many index types, and one of them is GIN - Generalized Inverted Index. We can index the words, automatically, with functions to extract them as "tsvector" - Text Search vectors.

If you are an advanced user of PostgreSQL, you will probably not learn new things here. Except if you want to see how I use it, with views and stored function to encapsulate the functions. If you are a user of Oracle or SQL Server, you know the idea but may be surprised by how it is easy to use in an Open Source database. If you are a user of ElasticSearch, you may see that for simple searches, SQL databases can provide this without an additional service.

My goal here is to show that we can use the same on the latest version of Yugabyte (I'm using 2.11 there). YugabyteDB is a distributed SQL database that reuses the PostgreSQL query layer, which means that many features come without additional effort. However, the distributed storage is different from the monolithic postgres, using LSM Tree instead of B-Tree and Heap tables. The YugabyteDB YBGIN is similar to YugabyteDB GIN, but implemented on top of LSM Tree indexes.

PostgreSQL: HEAP, BTREE and GIN

In PostgreSQL, here is how you define an HEAP table and a GIN index:

postgres=# create table demo
           (id bigint primary key, description text)
           USING HEAP;
CREATE TABLE

postgres=# create index demo_index on demo
           ( length(description) );
CREATE INDEX

postgres=# create index demo_gin on demo
           USING GIN
           ( (to_tsvector('simple',description)) );
CREATE INDEX

postgres=# select relname,reltype,amname 
           from pg_class left outer join pg_am
           on pg_class.relam=pg_am.oid
           where relname like 'demo%';

  relname   | reltype | amname
------------+---------+--------
 demo       |   16918 | heap
 demo_gin   |       0 | gin
 demo_index |       0 | btree
 demo_pkey  |       0 | btree
(4 rows)

The storage type defined with USING is visible as Access Method, which is the the name of the extensibility layer for different storage. The default for PostgreSQL is HEAP tables, BTREE indexes, and GIN can be used for text search.

I've precised simple text search configuration. You need to specify id to get a deterministic result. If not, you will get: ERROR: functions in index expression must be marked IMMUTABLE.

In YugabyteDB the default is LSM Tree for the indexes and table (which is stored clustered on the primary key):

yugabyte=# create table demo
           (id bigint primary key, description text)
           ;
CREATE TABLE

yugabyte=# create index demo_index on demo
           ( length(description) );
CREATE INDEX

yugabyte=# create index demo_gin on demo
           USING GIN
           ( (to_tsvector('simple',description)) );
NOTICE:  replacing access method "gin" with "ybgin"
CREATE INDEX

yugabyte=# select relname,reltype,amname 
           from pg_class left outer join pg_am
           on pg_class.relam=pg_am.oid
           where relname like 'demo%';

  relname   | reltype | amname
------------+---------+--------
 demo       |   16805 |
 demo_gin   |       0 | ybgin
 demo_index |       0 | lsm
 demo_pkey  |       0 | lsm
(4 rows)

A few comments on the differences with PostgreSQL.

  • there's no USING clause for the table because I'm using YB-2.11 which is PG11 compatible and table access methods came in PG12. All non-temporary tables in YugabyteDB use the distributed storage (LSM Tree).
  • the primary key, which is physically the same as the table, shows LSM.
  • regular secondary indexes are also LSM Trees
  • the USING GIN clause is transformed to USING YBGIN, which is the YugabyteDB implementation of it.

OMDB sample database

I'll load a sample database with many text columns: OMDB (open media database) - a free database for film media.

The https://github.com/credativ/omdb-postgresql project has a procedure to load into PostgreSQL. I can use the same to load into YugabyteDB, but this is not optimal. It is better to define the PRIMARY in the CREATE TABLE statement rather than ALTER TABLE later. What I did was load into PostgreSQL in order to export with pg_dump:

git clone https://github.com/credativ/omdb-postgresql.git
cd omdb-postgresql
./download
./import
pg_dump -f omdb.sql omdb

I have a quick script to move the PRIMARY KEY declaration into the CREATE TABLE:

awk '
/^ALTER TABLE ONLY/{last_alter_table=$NF}
/^ *ADD CONSTRAINT .* PRIMARY KEY /{sub(/ADD /,"");sub(/;$/,"");pk[last_alter_table]=$0",";$0=$0"\\r"}
NR > FNR && /^CREATE TABLE/{ print $0,pk[$3] > "omdb-pk.sql" ; next} NR > FNR { print > "omdb-pk.sql" }
' omdb.sql omdb.sql

A better solution would be using yb_dump from the YugabyteDB installation, but I want to make this post simple and the same with PostgreSQL and YugabyteDB.

Google-like Text Search Queries

To simplify queries, I create the following view that joins the movies with movies_abstract_en and adds the text search vector of words from the abstract (to_tsvector('simple',abstract)):

create or replace view movies_search as 
 select *
 from ( 
    select id as movie_id, name as movie_name from movies
 ) movies 
 natural join 
 (
    select movie_id, abstract as movie_abstract,
     to_tsvector('simple',abstract) as movie_abstract_ts
    from movie_abstracts_en
 ) movie_abstracts;

I also create a function to encapsulate the call to websearch_to_tsquery in order to query it with a google-like syntax:

create or replace function find_movie(text)
returns setof movies_search as $sql$
 select * from movies_search 
 where websearch_to_tsquery('simple',$1) 
 @@ movie_abstract_ts;
$sql$ language sql;

Here is a simple example of usage:

yugabyte=#  select * from find_movie(
            'Luke and Leia and "George Lucas"
            ');


movie_id|movie_name                                    |movie_abstract                                                                                                                                                                                                                                                 |movie_abstract_ts                                                                                                                                                                                                                                              |
-------------+----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1891|Star Wars: Episode V - The Empire Strikes Back|The Empire Strikes Back is considered the most morally and emotionally complex of the original Star Wars trilogy, continuing creator George Lucas's epic saga where Star Wars: Episode IV - A New Hope left off. Masterful storytelling weaves together multipl|'a':31 'against':46 'and':10,48,59 'archetypal':41 'as':50 'attempts':53 'back':4 'capture':55 'complex':12 'considered':6 'continuing':19 'creator':20 'desperately':52 'emotionally':11 'empire':2 'epic':24 'episode':29 'for':57 'george':21 'han':47 'he':|
      10|Star Wars                                     |A series of six films from the Director, Screenwriter and Producer George Lucas. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are now house hold names from one of the most successful film projects of all ti|'a':1 'all':43 'and':10,22 'are':29 'c3po':20 'characters':25 'darth':18 'director':8 'film':28,40 'films':5 'from':6,26,34 'george':12 'hold':32 'house':31 'leia':17 'lucas':13 'luke':14 'many':23 'most':38 'names':33 'now':30 'of':3,36,42 'one':35 'othe|
      11|Star Wars: Episode IV – A New Hope            |A New Hope was the first Star Wars film from the director, screenwriter, and producer George Lucas, although it is the fourth episode in the series of six. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are n|'a':1 'all':59 'although':18 'and':14,37 'are':44 'c3po':35 'characters':40 'darth':33 'director':12 'episode':23 'film':9,43,56 'first':6 'fourth':22 'from':10,41,50 'george':16 'hold':48 'hope':3 'house':47 'house-hold-names':46 'in':24 'is':20 'it':19 |

You can see on the last column the text search vector of words, with their position, that is used by the @@ function. And the text search query generated from the google-like syntax is:

yugabyte=#  select websearch_to_tsquery('simple',
            'Luke and Leia and "George Lucas"'
            );

                  websearch_to_tsquery
-------------------------------------------------------------
 'luke' & 'and' & 'leia' & 'and' & 'george' <-> 'lucas'
(1 row)

This Text Search syntax is powerful (<-> is for consecutive words and you can use <3> to accept them spaced by 2 words). The goal of this post is not to go into all details as it is documented in many places. For common searches, the websearch_to_tsquery syntax is probably easier.

This looks good except that the underlying query has to scan the whole table:

yugabyte=#  explain (analyze, verbose) 
            select * from movies_search  where
            websearch_to_tsquery('simple',
            'Luke and Leia and "George Lucas"'
            ) @@ movie_abstract_ts;

                                                                                                                                                                                                                                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..716.39 rows=1000 width=104) (actual time=20.359..79.051 rows=3 loops=1)
   Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
   Inner Unique: true
   ->  Seq Scan on public.movie_abstracts_en  (cost=0.00..352.50 rows=1000 width=40) (actual time=19.203..76.525 rows=3 loops=1)
         Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
         Filter: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
         Rows Removed by Filter: 2683
   ->  Index Scan using movies_pkey on public.movies  (cost=0.00..0.11 rows=1 width=40) (actual time=0.766..0.766 rows=1 loops=3)
         Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
         Index Cond: (movies.id = movie_abstracts_en.movie_id)
 Planning Time: 2.722 ms
 Execution Time: 79.153 ms
(12 rows)

The ::tsquery @@ to_tsvector condition is a Filter after Seq Scan. A regular index would not help. This is where GIN indexes come into play as they can index the array of words.

YBGIN, the GIN index in YugabyteDB

I can create an index on the function behind "movie_abstract_ts". The GIN index will have entries for each array element, which are words when the text is parsed by to_tsvector()

yugabyte=#  create index movie_abstracts_en_ts_vector 
            on movie_abstracts_en 
            using ybgin ( 
            ( to_tsvector('pg_catalog.simple',abstract) ) 
            );
CREATE INDEX

Now the same EXPLAIN shows the filter in Index Cond:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=24.00..395.90 rows=1000 width=104) (actual time=3.729..5.030 rows=3 loops=1)
   Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
   Inner Unique: true
   ->  Index Scan using movie_abstracts_en_ts_vector on public.movie_abstracts_en  (cost=24.00..32.01 rows=1000 width=40) (actual time=3.215..3.486 rows=3 loops=1)
         Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
         Index Cond: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
         Rows Removed by Index Recheck: 7
   ->  Index Scan using movies_pkey on public.movies  (cost=0.00..0.11 rows=1 width=40) (actual time=0.453..0.453 rows=1 loops=3)
         Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
         Index Cond: (movies.id = movie_abstracts_en.movie_id)
 Planning Time: 4.134 ms
 Execution Time: 5.290 ms
(12 rows)

The GIN index is used to filter-out most of the non-matching rows, but there can be some false positives. Here, from the 2686 rows in movie_abstracts_en 10 have been selected by the index and filtered further (Rows Removed by Index Recheck: 7) down to the resulting 3 (rows=3)

This was a quick introduction to Text Search in PostgreSQL and YugabyteDB. Many OLTP application require some free search on a few text columns. This is very simple to do within a PostgreSQL compatible database. And with a distributed SQL database, like YugabyteDB, it scales out, like the specialized text search services, because the table and the index are sharded and replicated into multiple nodes.

Addendum to answer a question on Twitter:

create index demo_trgm on movies using gin (name gin_trgm_ops);
select name from movies where name like '%hiker%';

11