15
Text Search example with the OMDB sample database
This post exposes some basic examples for one of the greatest feature of PostgreSQL: text search. In standard SQL, we can use LIKE, SIMILAR and REGEXP. But general text search cannot be optimized with simple B-Tree indexes on the column value. Text contain words, and indexing the text as a whole is not sufficient. Fortunately, PostgreSQL provides many index types, and one of them is GIN - Generalized Inverted Index. We can index the words, automatically, with functions to extract them as "tsvector" - Text Search vectors.
If you are an advanced user of PostgreSQL, you will probably not learn new things here. Except if you want to see how I use it, with views and stored function to encapsulate the functions. If you are a user of Oracle or SQL Server, you know the idea but may be surprised by how it is easy to use in an Open Source database. If you are a user of ElasticSearch, you may see that for simple searches, SQL databases can provide this without an additional service.
My goal here is to show that we can use the same on the latest version of Yugabyte (I'm using 2.11 there). YugabyteDB is a distributed SQL database that reuses the PostgreSQL query layer, which means that many features come without additional effort. However, the distributed storage is different from the monolithic postgres, using LSM Tree instead of B-Tree and Heap tables. The YugabyteDB YBGIN is similar to YugabyteDB GIN, but implemented on top of LSM Tree indexes.
In PostgreSQL, here is how you define an HEAP table and a GIN index:
postgres=# create table demo
(id bigint primary key, description text)
USING HEAP;
CREATE TABLE
postgres=# create index demo_index on demo
( length(description) );
CREATE INDEX
postgres=# create index demo_gin on demo
USING GIN
( (to_tsvector('simple',description)) );
CREATE INDEX
postgres=# select relname,reltype,amname
from pg_class left outer join pg_am
on pg_class.relam=pg_am.oid
where relname like 'demo%';
relname | reltype | amname
------------+---------+--------
demo | 16918 | heap
demo_gin | 0 | gin
demo_index | 0 | btree
demo_pkey | 0 | btree
(4 rows)
The storage type defined with USING
is visible as Access Method, which is the the name of the extensibility layer for different storage. The default for PostgreSQL is HEAP tables, BTREE indexes, and GIN can be used for text search.
I've precised simple
text search configuration. You need to specify id to get a deterministic result. If not, you will get: ERROR: functions in index expression must be marked IMMUTABLE
.
In YugabyteDB the default is LSM Tree for the indexes and table (which is stored clustered on the primary key):
yugabyte=# create table demo
(id bigint primary key, description text)
;
CREATE TABLE
yugabyte=# create index demo_index on demo
( length(description) );
CREATE INDEX
yugabyte=# create index demo_gin on demo
USING GIN
( (to_tsvector('simple',description)) );
NOTICE: replacing access method "gin" with "ybgin"
CREATE INDEX
yugabyte=# select relname,reltype,amname
from pg_class left outer join pg_am
on pg_class.relam=pg_am.oid
where relname like 'demo%';
relname | reltype | amname
------------+---------+--------
demo | 16805 |
demo_gin | 0 | ybgin
demo_index | 0 | lsm
demo_pkey | 0 | lsm
(4 rows)
A few comments on the differences with PostgreSQL.
- there's no USING clause for the table because I'm using YB-2.11 which is PG11 compatible and table access methods came in PG12. All non-temporary tables in YugabyteDB use the distributed storage (LSM Tree).
- the primary key, which is physically the same as the table, shows LSM.
- regular secondary indexes are also LSM Trees
- the
USING GIN
clause is transformed toUSING YBGIN
, which is the YugabyteDB implementation of it.
I'll load a sample database with many text columns: OMDB (open media database) - a free database for film media.
The https://github.com/credativ/omdb-postgresql
project has a procedure to load into PostgreSQL. I can use the same to load into YugabyteDB, but this is not optimal. It is better to define the PRIMARY in the CREATE TABLE statement rather than ALTER TABLE later. What I did was load into PostgreSQL in order to export with pg_dump:
git clone https://github.com/credativ/omdb-postgresql.git
cd omdb-postgresql
./download
./import
pg_dump -f omdb.sql omdb
I have a quick script to move the PRIMARY KEY declaration into the CREATE TABLE:
awk '
/^ALTER TABLE ONLY/{last_alter_table=$NF}
/^ *ADD CONSTRAINT .* PRIMARY KEY /{sub(/ADD /,"");sub(/;$/,"");pk[last_alter_table]=$0",";$0=$0"\\r"}
NR > FNR && /^CREATE TABLE/{ print $0,pk[$3] > "omdb-pk.sql" ; next} NR > FNR { print > "omdb-pk.sql" }
' omdb.sql omdb.sql
A better solution would be using yb_dump from the YugabyteDB installation, but I want to make this post simple and the same with PostgreSQL and YugabyteDB.
To simplify queries, I create the following view that joins the movies
with movies_abstract_en
and adds the text search vector of words from the abstract (to_tsvector('simple',abstract)
):
create or replace view movies_search as
select *
from (
select id as movie_id, name as movie_name from movies
) movies
natural join
(
select movie_id, abstract as movie_abstract,
to_tsvector('simple',abstract) as movie_abstract_ts
from movie_abstracts_en
) movie_abstracts;
I also create a function to encapsulate the call to websearch_to_tsquery
in order to query it with a google-like syntax:
create or replace function find_movie(text)
returns setof movies_search as $sql$
select * from movies_search
where websearch_to_tsquery('simple',$1)
@@ movie_abstract_ts;
$sql$ language sql;
Here is a simple example of usage:
yugabyte=# select * from find_movie(
'Luke and Leia and "George Lucas"
');
movie_id|movie_name |movie_abstract |movie_abstract_ts |
-------------+----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1891|Star Wars: Episode V - The Empire Strikes Back|The Empire Strikes Back is considered the most morally and emotionally complex of the original Star Wars trilogy, continuing creator George Lucas's epic saga where Star Wars: Episode IV - A New Hope left off. Masterful storytelling weaves together multipl|'a':31 'against':46 'and':10,48,59 'archetypal':41 'as':50 'attempts':53 'back':4 'capture':55 'complex':12 'considered':6 'continuing':19 'creator':20 'desperately':52 'emotionally':11 'empire':2 'epic':24 'episode':29 'for':57 'george':21 'han':47 'he':|
10|Star Wars |A series of six films from the Director, Screenwriter and Producer George Lucas. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are now house hold names from one of the most successful film projects of all ti|'a':1 'all':43 'and':10,22 'are':29 'c3po':20 'characters':25 'darth':18 'director':8 'film':28,40 'films':5 'from':6,26,34 'george':12 'hold':32 'house':31 'leia':17 'lucas':13 'luke':14 'many':23 'most':38 'names':33 'now':30 'of':3,36,42 'one':35 'othe|
11|Star Wars: Episode IV – A New Hope |A New Hope was the first Star Wars film from the director, screenwriter, and producer George Lucas, although it is the fourth episode in the series of six. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are n|'a':1 'all':59 'although':18 'and':14,37 'are':44 'c3po':35 'characters':40 'darth':33 'director':12 'episode':23 'film':9,43,56 'first':6 'fourth':22 'from':10,41,50 'george':16 'hold':48 'hope':3 'house':47 'house-hold-names':46 'in':24 'is':20 'it':19 |
You can see on the last column the text search vector of words, with their position, that is used by the @@ function. And the text search query generated from the google-like syntax is:
yugabyte=# select websearch_to_tsquery('simple',
'Luke and Leia and "George Lucas"'
);
websearch_to_tsquery
-------------------------------------------------------------
'luke' & 'and' & 'leia' & 'and' & 'george' <-> 'lucas'
(1 row)
This Text Search syntax is powerful (<->
is for consecutive words and you can use <3>
to accept them spaced by 2 words). The goal of this post is not to go into all details as it is documented in many places. For common searches, the websearch_to_tsquery
syntax is probably easier.
This looks good except that the underlying query has to scan the whole table:
yugabyte=# explain (analyze, verbose)
select * from movies_search where
websearch_to_tsquery('simple',
'Luke and Leia and "George Lucas"'
) @@ movie_abstract_ts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..716.39 rows=1000 width=104) (actual time=20.359..79.051 rows=3 loops=1)
Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
Inner Unique: true
-> Seq Scan on public.movie_abstracts_en (cost=0.00..352.50 rows=1000 width=40) (actual time=19.203..76.525 rows=3 loops=1)
Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
Filter: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
Rows Removed by Filter: 2683
-> Index Scan using movies_pkey on public.movies (cost=0.00..0.11 rows=1 width=40) (actual time=0.766..0.766 rows=1 loops=3)
Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
Index Cond: (movies.id = movie_abstracts_en.movie_id)
Planning Time: 2.722 ms
Execution Time: 79.153 ms
(12 rows)
The ::tsquery @@ to_tsvector
condition is a Filter after Seq Scan
. A regular index would not help. This is where GIN indexes come into play as they can index the array of words.
I can create an index on the function behind "movie_abstract_ts". The GIN index will have entries for each array element, which are words when the text is parsed by to_tsvector()
yugabyte=# create index movie_abstracts_en_ts_vector
on movie_abstracts_en
using ybgin (
( to_tsvector('pg_catalog.simple',abstract) )
);
CREATE INDEX
Now the same EXPLAIN shows the filter in Index Cond
:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=24.00..395.90 rows=1000 width=104) (actual time=3.729..5.030 rows=3 loops=1)
Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
Inner Unique: true
-> Index Scan using movie_abstracts_en_ts_vector on public.movie_abstracts_en (cost=24.00..32.01 rows=1000 width=40) (actual time=3.215..3.486 rows=3 loops=1)
Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
Index Cond: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
Rows Removed by Index Recheck: 7
-> Index Scan using movies_pkey on public.movies (cost=0.00..0.11 rows=1 width=40) (actual time=0.453..0.453 rows=1 loops=3)
Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
Index Cond: (movies.id = movie_abstracts_en.movie_id)
Planning Time: 4.134 ms
Execution Time: 5.290 ms
(12 rows)
The GIN index is used to filter-out most of the non-matching rows, but there can be some false positives. Here, from the 2686 rows in movie_abstracts_en
10 have been selected by the index and filtered further (Rows Removed by Index Recheck: 7
) down to the resulting 3 (rows=3
)
This was a quick introduction to Text Search in PostgreSQL and YugabyteDB. Many OLTP application require some free search on a few text columns. This is very simple to do within a PostgreSQL compatible database. And with a distributed SQL database, like YugabyteDB, it scales out, like the specialized text search services, because the table and the index are sharded and replicated into multiple nodes.
@franck_sidi @planetpostgres Yes, sure. I didn't want to make my @OracleDatabase followers too jealous immediately 😂08:06 AM - 30 Nov 2021
create index demo_trgm on movies using gin (name gin_trgm_ops);
select name from movies where name like '%hiker%';
15