SQL to derive subsequences

This is an example of using SQL to derive subsequences according to when some value in a table changes or stays the same.
It assumes:
  • that we have a source table of DbsNm.TblNm ;
  • that there is some value ( ImportantSequentialValue ) that always indicates which records come before others ;
  • that there is one or more category columns - given here as two columns: Category_A & Category_B ;
  • that there is a column or an expression which is the thing that variously changes or stays the same: ExprValuThatMayChange .
  • The rough idea is to use multiple ROW_NUMBER calls:
  • one for the whole category ;
  • one for the changes within the category
  • another one for each sequential instance of change within the category
  • and then use the differences between these to generate a unique ID for each subsequence.
    The SQL I'm giving here derives identifiers for the subsequences - as ArbitraryChangeGroupId - and then uses that for an overall summary of them and their boundary points in a final SELECT and GROUP BY.
    There is a display of example data further below.
    With that preamble, here is the full example SQL:
    WITH
    Round_1 AS (
        SELECT
            SrcTbl.* ,
            ( ExprValuThatMayChange ) AS SubsetMarker ,
            ( -- Derive CounterPerCategory
                ROW_NUMBER() OVER (
                    PARTITION BY
                        Category_A ,
                        Category_B
                    ORDER BY
                        ImportantSequentialValue )
                ) AS CounterPerCategory ,
            ( -- Derive CounterPerSubsetCategory
                ROW_NUMBER() OVER (
                    PARTITION BY
                        Category_A ,
                        Category_B ,
                        SubsetMarker
                    ORDER BY
                        ImportantSequentialValue )
                ) AS CounterPerSubsetCategory ,
            ( -- Derive SubGrpNum
                CounterPerCategory
                -
                CounterPerSubsetCategory
                ) AS SubGrpNum
        FROM
            DbsNm.TblNm AS SrcTbl
        ) ,
    Round_2 AS (
        SELECT
            Round_1.* ,
            ( -- Derive ResetCounterPerCategorySubset
                ROW_NUMBER() OVER (
                    PARTITION BY
                        Category_A ,
                        Category_B ,
                        SubGrpNum
                    ORDER BY
                        ImportantSequentialValue )
                ) AS ResetCounterPerCategorySubset ,
            ( -- Derive ArbitraryChangeGroupId
                CounterPerSubsetCategory
                -
                ResetCounterPerCategorySubset
                ) AS ArbitraryChangeGroupId
        FROM
            Round_1
        )
    -- Example summary by the Arbitrary Change Groups SELECT
    SELECT
        Category_A ,
        Category_B ,
        ArbitraryChangeGroupId ,
        MIN( ImportantSequentialValue ) AS SubGrp_MinSeqVal ,
        MAX( ImportantSequentialValue ) AS SubGrp_MaxSeqVal ,
        COUNT(*) AS SubGrp_Rw_Cnt
    FROM
        Round_2
    GROUP BY
        Category_A ,
        Category_B ,
        ArbitraryChangeGroupId
    ORDER BY
        SubGrp_MinSeqVal
    ;
    And here are all the partial steps plotted out to show quite why it works.
  • note that I've separated out, that which becomes column G, as E & F so you can see how the values in G come about;
  • for simplicity I've given just one "category" column and we're only looking at one value in it.
  • Cat Seq Valu D E F G H I J
    Eric ab This 1 1 1 0 1 0
    Eric bc This 2 2 2 0 2 0
    Eric cd That 3 1 1 2 1 2
    Eric de That 4 2 2 2 2 2
    Eric ef That 5 3 3 2 3 2
    Eric fg This 6 3 3 3 1 5
    Eric gh This 7 4 4 3 2 5
    Eric hi That 8 4 4 4 1 7
    Eric jk This 9 5 5 4 1 8
    Eric kl This 10 6 6 4 2 8
    Eric lm This 11 7 7 4 3 8
    About those columns:
    Column In source Meaning or derivation Calc
    Cat Y Category A one or more such columns
    Seq Y ImportantSequentialValue is often a timestamp
    Valu Y ExprValuThatMayChange
    D - CounterPerCategory row#
    E - CounterPerSubsetCategory for This see G
    F - CounterPerSubsetCategory for That see G
    G - CounterPerSubsetCategory for Either row#
    H - SubGrpNum D minus G
    I - ResetCounterPerCategorySubset row#
    J - ArbitraryChangeGroupId D minus I
    Notes
    You will see that the ArbitraryChangeGroupId values are not contiguous, but that for this purpose, that simply doesn't matter.
    Do note that in the SQL I've separated out all the steps so that each partial result is present as a named column. In practice, only some of those need to be made explicit, the others can just exist as expressions within the derivations.
    How it is best to do that will vary with the query engine being used and its approaches to optimisation. Some will just work around the named columns anyway, conversely some will become inefficient if unable to recognise that identical uses of ROW_NUMBER are happening. As the saying goes: your mileage may vary.
    In my work, I have an SQL script processing tool that lets me store the script as a pro forma and then have text substitutions made at run time. Thus I can affirm that the above SQL really works, as I used it via such substitutions for an actual run on real data.
    Background
    I originally wrote a version of this method a great many years ago. Of course, old hacks like me tend to say that kind of thing, and I probably do have the code somewhere in my archive.
    I can't quite remember whether I had access to ROW_NUMBER as a window function at the time, or had to derive something similar the hard way. I can say that the dialect I was using then didn't support CTE syntax, so I would have entirely used the derived table syntax.
    Recently the problem of doing this kind of thing came up when I was asked for help out with some analysis. While I knew I had a method somewhere in my past, I couldn't quite remember how it went.
    So, like anyone, I net searched and it was this posting that reminded me of the method.
    Resetting Row number according to record data change
    That only covers the reset counter aspect but it was enough to jog my memory for me to write out and test the full solution - so I'd like to give the link here in appreciation.
    Teradata Calendar substitution
    One of the test substitutions that I ran (see Notes section) was to apply the method to the Teradata System Calendar. Therefore this is only of use to those to those using a Teradata, but hey, I've no reason not to share this too. It was interesting coming up with a simple made-up category for the "change" values.
    WITH
    Round_1 AS (
        SELECT
            SrcTbl.* ,
            ( ( day_of_week + day_of_month + day_of_year + day_of_calendar + weekday_of_month ) MOD 2 ) AS SubsetMarker ,
            ( -- Derive CounterPerCategory
                ROW_NUMBER() OVER (
                    PARTITION BY
                        year_of_calendar ,
                        quarter_of_calendar
                    ORDER BY
                        calendar_date )
                ) AS CounterPerCategory ,
            ( -- Derive CounterPerSubsetCategory
                ROW_NUMBER() OVER (
                    PARTITION BY
                        year_of_calendar ,
                        quarter_of_calendar ,
                        SubsetMarker
                    ORDER BY
                        calendar_date )
                ) AS CounterPerSubsetCategory ,
            ( -- Derive SubGrpNum
                CounterPerCategory
                -
                CounterPerSubsetCategory
                ) AS SubGrpNum
        FROM
            sys_calendar.calendar AS SrcTbl
        ) ,
    Round_2 AS (
        SELECT
            Round_1.* ,
            ( -- Derive
                ROW_NUMBER() OVER (
                    PARTITION BY
                        year_of_calendar ,
                        quarter_of_calendar ,
                        SubGrpNum
                    ORDER BY
                        calendar_date )
                ) AS ResetCounterPerCategorySubset ,
            ( CounterPerSubsetCategory
                -
                ResetCounterPerCategorySubset
                ) AS ArbitraryChangeGroupId
        FROM
            Round_1
        )
    -- Example summary by the Arbitrary Change Groups SELECT
    SELECT
        year_of_calendar ,
        quarter_of_calendar ,
        ArbitraryChangeGroupId ,
        MIN( calendar_date  ) AS SubGrp_MinSeqVal ,
        MAX( calendar_date  ) AS SubGrp_MaxSeqVal ,
        COUNT(*) AS SubGrp_Rw_Cnt
    FROM
        Round_2
    GROUP BY
        year_of_calendar ,
        quarter_of_calendar ,
        ArbitraryChangeGroupId
    ORDER BY
        SubGrp_MinSeqVal
    ;

    26

    This website collects cookies to deliver better user experience

    SQL to derive subsequences