Learn how to write SQL recursive CTE in 5 steps

You don't need a Graph database to follow a hierarchy of nodes. SQL can do joins, self-join, and even joins to its previous result, iteratively. This WITH RECURSIVE clause is often called "Recursive CTE". This might require a bit of abstraction for the developers used to procedural languages. Here is how I explain it by joining "manually" the two first levels, without recursion, and then switching to "recursive" to go further. Starting with the first two levels helps to validate the logic easily.

(0) Example: create the EMP table

I'll create the oldest sample table: EMP for employees with a self-reference to the manager:

CREATE TABLE emp (empno, ename,    job,        mgr,   hiredate,     sal, comm, deptno, email, other_info) as values
(7369, 'SMITH',  'CLERK',     7902, '1980-12-17',  800, NULL,   20,'[email protected]', '{"skills":["accounting"]}'),
(7499, 'ALLEN',  'SALESMAN',  7698, '1981-02-20', 1600,  300,   30,'[email protected]', null),
(7521, 'WARD',   'SALESMAN',  7698, '1981-02-22', 1250,  500,   30,'[email protected]', null),
(7566, 'JONES',  'MANAGER',   7839, '1981-04-02', 2975, NULL,   20,'[email protected]', null),
(7654, 'MARTIN', 'SALESMAN',  7698, '1981-09-28', 1250, 1400,   30,'[email protected]', null),
(7698, 'BLAKE',  'MANAGER',   7839, '1981-05-01', 2850, NULL,   30,'[email protected]', null),
(7782, 'CLARK',  'MANAGER',   7839, '1981-06-09', 2450, NULL,   10,'[email protected]', '{"skills":["C","C++","SQL"]}'),
(7788, 'SCOTT',  'ANALYST',   7566, '1982-12-09', 3000, NULL,   20,'[email protected]', '{"cat":"tiger"}'),
(7839, 'KING',   'PRESIDENT', NULL, '1981-11-17', 5000, NULL,   10,'[email protected]', null),
(7844, 'TURNER', 'SALESMAN',  7698, '1981-09-08', 1500,    0,   30,'[email protected]', null),
(7876, 'ADAMS',  'CLERK',     7788, '1983-01-12', 1100, NULL,   20,'[email protected]', null),
(7900, 'JAMES',  'CLERK',     7698, '1981-12-03',  950, NULL,   30,'[email protected]', null),
(7902, 'FORD',   'ANALYST',   7566, '1981-12-03', 3000, NULL,   20,'[email protected]', '{"skills":["SQL","CQL"]}'),
(7934, 'MILLER', 'CLERK',     7782, '1982-01-23', 1300, NULL,   10,'[email protected]', null)
;

The self reference from mgr to empno can be declared as:

ALTER TABLE emp ADD PRIMARY KEY (empno)
;
ALTER TABLE emp ADD FOREIGN KEY (mgr) REFERENCES emp (empno)
;

This is not needed but it ensures the correctness of that that you query so that you can focus on your business logic rather than consistency test.

(1) find the root (level 0)

The first step is to identify the root of the graph. Do you want to start by the employees and show their manager, or start by the mangager and show its employees? I'll do the latter. The root of the hierarchy is the one with no manager (where mgr is null)

-- (1) query to find the root
select emp.empno,emp.ename,emp.mgr
from emp where mgr is null
;

The result is simple:

yugabyte-> ;

 empno | ename | mgr
-------+-------+-----
  7839 | KING  |
(1 row)

yugabyte=>

(2) define this level 0 in the WITH clause

I'll put it in a CTE (Common Table Expression), labelled as level0 and I select from it:

-- (2) define the root as as level 0 in a CTE
with recursive
level0 as (
-- (1) query to find the root
select emp.empno,emp.ename,emp.mgr
from emp where mgr is null
) 
-- (2) query this level
select * from level0
;

The result is the same. The top manager is KING.

yugabyte-> ;

 empno | ename | mgr
-------+-------+-----
  7839 | KING  |
(1 row)

yugabyte=>

I'm just building the blocks one by one. And I've left the comments to make it clear what is taken from the previous step and what is added. I already add RECURSIVE in the WITH clause but for the moment, as there are no self-reference, there will be only one iteration.

(3) define the join to the next level

Now that my level 0 is defined, I query the emp table again, to get the next level, with a join to level 0 to get the manager name:

-- (2) define the root as as level 0 in a CTE
with recursive
level0 as (
-- (1) query to find the root
select emp.empno,emp.ename,emp.mgr
from emp where mgr is null
) 
-- (3) join employees with level 0
select emp.empno,emp.ename,emp.mgr
,mgr.ename mgr_name
from emp
join level0 mgr on mgr.empno=emp.mgr
;

I have labelled the level 0 as mgr because it will be the manager of this new level of employee, for which I keep the table name emp. Then the join on mgr.empno=emp.mgr links the two levels. In the join result, I have added mgr.ename as mgr_name in order to see the name of the manager in the same line.

So here here are the employees referring to KING:

yugabyte-> ;

 empno | ename | mgr  | mgr_name
------------+-------+------+----------
  7698 | BLAKE | 7839 | KING
  7566 | JONES | 7839 | KING
  7782 | CLARK | 7839 | KING
(3 rows)

yugabyte=>

(4) concatenate the two levels with UNION

Showing the manager name was one step, but to show the whole hierarchy, we need to concatenate the rows from all levels. I start with my two first levels. I need a UNION ALL. For this, move my join query in the WITH clause and label it as leveln. I could have named it level1 because it joins with level0 but the goal is to make it an abstraction for the next level. Because I'll UNION ALL the two, they need the same columns. leveln has the manager name as mgr_name so I add this, as an empty string, when querying level0

-- (2) define the root as as level 0 in a CTE
with recursive
level0 as (
-- (1) query to find the root
select emp.empno,emp.ename,emp.mgr
from emp where mgr is null
),
-- (4) put the second level in the WITH clause
leveln as (
-- (3) join employees with level 0
select emp.empno,emp.ename,emp.mgr
,mgr.ename mgr_name
from emp
join level0 mgr on mgr.empno=emp.mgr
)
-- (4) concatenate the two levels
select *,'' mgr_name from level0
union all
select *             from leveln
;

This starts to look good, here are the two levels together:

yugabyte-> ;

 empno | ename | mgr  | mgr_name
------------+-------+------+----------
  7839 | KING  |      |
  7698 | BLAKE | 7839 | KING
  7566 | JONES | 7839 | KING
  7782 | CLARK | 7839 | KING
(4 rows)

yugabyte=>

This is where you can validate your result. The next step will be automatic to add recursion.

(5) get it in one recursive WITH clause

Now that I have validated the result for the two first levels, it is time to get it recursive. Instead of joining to level0 I'll join to leveln which is the previous result of itself. And I move the UNION ALL into this leveln CTE.

-- (2) define the root as as level 0 in a CTE
with recursive
level0 as (
-- (1) query to find the root
select emp.empno,emp.ename,emp.mgr
from emp where mgr is null
),
-- (4) put the second level in the WITH clause
leveln as (
-- (5) put the UNION ALL in the recursive level
-- (4) concatenate the two levels
select *,'' mgr_name from level0
union all
-- (3) join employees with level 0
select emp.empno,emp.ename,emp.mgr
,mgr.ename mgr_name
from emp
-- (5) change level0 to leveln to join iteratively
join leveln mgr on mgr.empno=emp.mgr
)
-- (5) query this level n
select *             from leveln
;

If you have validated the join with the first two level, this last transformation is automatic. At execution time, the first iteration will get the first level and the second part of the UNION ALL will have no rows because of the join with the previous result, which is empty. The next iteration will add the result of the join. And again until the join returns no rows because we are at a leaf, with the previous result, taken as a list of managers, has no employees.

yugabyte-> ;

 empno | ename  | mgr  | mgr_name
------------+--------+------+----------
  7839 | KING   |      |
  7698 | BLAKE  | 7839 | KING
  7566 | JONES  | 7839 | KING
  7782 | CLARK  | 7839 | KING
  7788 | SCOTT  | 7566 | JONES
  7902 | FORD   | 7566 | JONES
  7844 | TURNER | 7698 | BLAKE
  7499 | ALLEN  | 7698 | BLAKE
  7521 | WARD   | 7698 | BLAKE
  7654 | MARTIN | 7698 | BLAKE
  7900 | JAMES  | 7698 | BLAKE
  7934 | MILLER | 7782 | CLARK
  7876 | ADAMS  | 7788 | SCOTT
  7369 | SMITH  | 7902 | FORD
(14 rows)

yugabyte=>

This is where the magic is. If you carefully test your first two levels, all others will come by transforming the query to be recursive, iterating by joining on its previous result. Note that I've left the first level as a level0 CTE. You can put it into the leveln but I think it is more readable like this.

In the execution plan, you see it as Recursive union

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on leveln  (cost=20.23..22.86 rows=131 width=72) (actual time=0.023..0.281 rows=14 loops=1)
   CTE leveln
     ->  Recursive Union  (cost=0.00..20.23 rows=131 width=46) (actual time=0.021..0.269 rows=14 loops=1)
           ->  Seq Scan on emp  (cost=0.00..1.14 rows=1 width=46) (actual time=0.019..0.021 rows=1 loops=1)
                 Filter: (mgr IS NULL)
           ->  Hash Join  (cost=0.33..1.65 rows=13 width=46) (actual time=0.054..0.058 rows=3 loops=4)
                 Hash Cond: (emp_1.mgr = mgr.empno)
                 ->  Seq Scan on emp emp_1  (cost=0.00..1.14 rows=14 width=14) (actual time=0.002..0.003 rows=14 loops=4)
                 ->  Hash  (cost=0.20..0.20 rows=10 width=36) (actual time=0.044..0.044 rows=4 loops=4)
                       ->  WorkTable Scan on leveln mgr  (cost=0.00..0.20 rows=10 width=36) (actual time=0.001..0.001 rows=4 loops=4)

The number of loops tells you how many time the second part was executed. All this runs the same in PostgreSQL and YugabyteDB.

23