21
SQL to derive subsequences
This is an example of using SQL to derive subsequences according to when some value in a table changes or stays the same.
It assumes:
- that we have a source table of
DbsNm.TblNm
; - that there is some value (
ImportantSequentialValue
) that always indicates which records come before others ; - that there is one or more category columns - given here as two columns:
Category_A
&Category_B
; - that there is a column or an expression which is the thing that variously changes or stays the same:
ExprValuThatMayChange
.
The rough idea is to use multiple ROW_NUMBER calls:
- one for the whole category ;
- one for the changes within the category
- another one for each sequential instance of change within the category
and then use the differences between these to generate a unique ID for each subsequence.
The SQL I'm giving here derives identifiers for the subsequences - as ArbitraryChangeGroupId
- and then uses that for an overall summary of them and their boundary points in a final SELECT and GROUP BY.
There is a display of example data further below.
With that preamble, here is the full example SQL:
WITH
Round_1 AS (
SELECT
SrcTbl.* ,
( ExprValuThatMayChange ) AS SubsetMarker ,
( -- Derive CounterPerCategory
ROW_NUMBER() OVER (
PARTITION BY
Category_A ,
Category_B
ORDER BY
ImportantSequentialValue )
) AS CounterPerCategory ,
( -- Derive CounterPerSubsetCategory
ROW_NUMBER() OVER (
PARTITION BY
Category_A ,
Category_B ,
SubsetMarker
ORDER BY
ImportantSequentialValue )
) AS CounterPerSubsetCategory ,
( -- Derive SubGrpNum
CounterPerCategory
-
CounterPerSubsetCategory
) AS SubGrpNum
FROM
DbsNm.TblNm AS SrcTbl
) ,
Round_2 AS (
SELECT
Round_1.* ,
( -- Derive ResetCounterPerCategorySubset
ROW_NUMBER() OVER (
PARTITION BY
Category_A ,
Category_B ,
SubGrpNum
ORDER BY
ImportantSequentialValue )
) AS ResetCounterPerCategorySubset ,
( -- Derive ArbitraryChangeGroupId
CounterPerSubsetCategory
-
ResetCounterPerCategorySubset
) AS ArbitraryChangeGroupId
FROM
Round_1
)
-- Example summary by the Arbitrary Change Groups SELECT
SELECT
Category_A ,
Category_B ,
ArbitraryChangeGroupId ,
MIN( ImportantSequentialValue ) AS SubGrp_MinSeqVal ,
MAX( ImportantSequentialValue ) AS SubGrp_MaxSeqVal ,
COUNT(*) AS SubGrp_Rw_Cnt
FROM
Round_2
GROUP BY
Category_A ,
Category_B ,
ArbitraryChangeGroupId
ORDER BY
SubGrp_MinSeqVal
;
And here are all the partial steps plotted out to show quite why it works.
- note that I've separated out, that which becomes column G, as E & F so you can see how the values in G come about;
- for simplicity I've given just one "category" column and we're only looking at one value in it.
Cat | Seq | Valu | D | E | F | G | H | I | J |
---|---|---|---|---|---|---|---|---|---|
Eric | ab | This | 1 | 1 | 1 | 0 | 1 | 0 | |
Eric | bc | This | 2 | 2 | 2 | 0 | 2 | 0 | |
Eric | cd | That | 3 | 1 | 1 | 2 | 1 | 2 | |
Eric | de | That | 4 | 2 | 2 | 2 | 2 | 2 | |
Eric | ef | That | 5 | 3 | 3 | 2 | 3 | 2 | |
Eric | fg | This | 6 | 3 | 3 | 3 | 1 | 5 | |
Eric | gh | This | 7 | 4 | 4 | 3 | 2 | 5 | |
Eric | hi | That | 8 | 4 | 4 | 4 | 1 | 7 | |
Eric | jk | This | 9 | 5 | 5 | 4 | 1 | 8 | |
Eric | kl | This | 10 | 6 | 6 | 4 | 2 | 8 | |
Eric | lm | This | 11 | 7 | 7 | 4 | 3 | 8 |
About those columns:
Column | In source | Meaning or derivation | Calc |
---|---|---|---|
Cat | Y | Category A | one or more such columns |
Seq | Y | ImportantSequentialValue | is often a timestamp |
Valu | Y | ExprValuThatMayChange | |
D | - | CounterPerCategory | row# |
E | - | CounterPerSubsetCategory for This | see G |
F | - | CounterPerSubsetCategory for That | see G |
G | - | CounterPerSubsetCategory for Either | row# |
H | - | SubGrpNum | D minus G |
I | - | ResetCounterPerCategorySubset | row# |
J | - | ArbitraryChangeGroupId | D minus I |
You will see that the ArbitraryChangeGroupId
values are not contiguous, but that for this purpose, that simply doesn't matter.
Do note that in the SQL I've separated out all the steps so that each partial result is present as a named column. In practice, only some of those need to be made explicit, the others can just exist as expressions within the derivations.
How it is best to do that will vary with the query engine being used and its approaches to optimisation. Some will just work around the named columns anyway, conversely some will become inefficient if unable to recognise that identical uses of ROW_NUMBER are happening. As the saying goes: your mileage may vary.
In my work, I have an SQL script processing tool that lets me store the script as a pro forma and then have text substitutions made at run time. Thus I can affirm that the above SQL really works, as I used it via such substitutions for an actual run on real data.
I originally wrote a version of this method a great many years ago. Of course, old hacks like me tend to say that kind of thing, and I probably do have the code somewhere in my archive.
I can't quite remember whether I had access to ROW_NUMBER as a window function at the time, or had to derive something similar the hard way. I can say that the dialect I was using then didn't support CTE syntax, so I would have entirely used the derived table syntax.
Recently the problem of doing this kind of thing came up when I was asked for help out with some analysis. While I knew I had a method somewhere in my past, I couldn't quite remember how it went.
So, like anyone, I net searched and it was this posting that reminded me of the method.
Resetting Row number according to record data change
That only covers the reset counter aspect but it was enough to jog my memory for me to write out and test the full solution - so I'd like to give the link here in appreciation.
One of the test substitutions that I ran (see Notes section) was to apply the method to the Teradata System Calendar. Therefore this is only of use to those to those using a Teradata, but hey, I've no reason not to share this too. It was interesting coming up with a simple made-up category for the "change" values.
WITH
Round_1 AS (
SELECT
SrcTbl.* ,
( ( day_of_week + day_of_month + day_of_year + day_of_calendar + weekday_of_month ) MOD 2 ) AS SubsetMarker ,
( -- Derive CounterPerCategory
ROW_NUMBER() OVER (
PARTITION BY
year_of_calendar ,
quarter_of_calendar
ORDER BY
calendar_date )
) AS CounterPerCategory ,
( -- Derive CounterPerSubsetCategory
ROW_NUMBER() OVER (
PARTITION BY
year_of_calendar ,
quarter_of_calendar ,
SubsetMarker
ORDER BY
calendar_date )
) AS CounterPerSubsetCategory ,
( -- Derive SubGrpNum
CounterPerCategory
-
CounterPerSubsetCategory
) AS SubGrpNum
FROM
sys_calendar.calendar AS SrcTbl
) ,
Round_2 AS (
SELECT
Round_1.* ,
( -- Derive
ROW_NUMBER() OVER (
PARTITION BY
year_of_calendar ,
quarter_of_calendar ,
SubGrpNum
ORDER BY
calendar_date )
) AS ResetCounterPerCategorySubset ,
( CounterPerSubsetCategory
-
ResetCounterPerCategorySubset
) AS ArbitraryChangeGroupId
FROM
Round_1
)
-- Example summary by the Arbitrary Change Groups SELECT
SELECT
year_of_calendar ,
quarter_of_calendar ,
ArbitraryChangeGroupId ,
MIN( calendar_date ) AS SubGrp_MinSeqVal ,
MAX( calendar_date ) AS SubGrp_MaxSeqVal ,
COUNT(*) AS SubGrp_Rw_Cnt
FROM
Round_2
GROUP BY
year_of_calendar ,
quarter_of_calendar ,
ArbitraryChangeGroupId
ORDER BY
SubGrp_MinSeqVal
;
21