31
SQL for detecting contiguous and non-contiguous date spans
This is an example of using SQL to work out where there are gaps between date ranges in a table.
It assumes:
- that we have a source table of
DbsNm.TblNm
with columns for: - some categories - in this example there are three:
Category_A
,Category_B
&Category_C
but there could be one or more ; - a pair of columns giving a date span, in the form of "from" and "upto" -
Src_From_Dt
andSrc_Upto_Dt
Here, we'll assume a simple set of "from" and "upto" dates, each of which defines a valid period of some kind of state - e.g. "registered".
Here's an example for one category and two date span records:
Cat | From | Upto |
---|---|---|
Eric | 11 Feb 2007 | 27 Oct 2007 |
Eric | 28 Oct 2007 | 31 Dec 2009 |
As 28 Oct 2007
is the next day after 27 Oct 2007
these two records actually represent contiguous entries for Eric
Here's another date span example:
Cat | From | Upto |
---|---|---|
Erin | 01 Apr 2007 | 30 Sep 2007 |
Erin | 01 Apr 2008 | 30 Sep 2008 |
There, we have a distinct gap between the two records - to the effect that Erin
wasn't registered for a period of 2007-2008 and there are 2 non-contiguous registrations.
Once this kind of thing is possible, then we can expect to also see combinations of these occuring. For example:
Cat | From | Upto |
---|---|---|
Erni | 01 Jan 2000 | 31 Dec 2003 |
Erni | 01 Jan 2004 | 31 Dec 2006 |
Erni | 01 Jab 2009 | 31 Dec 2020 |
in which there are two distinct spans covered by those records.
So what we want to pull out of the data is:
- what are the sets of contiguous ranges and where are there gaps?
Note:
- This idea of a pair of span dates is quite common in "registrations" data, where each registration record applies for a span of inclusive dates. As a topic this can get quite complicated. See the end Notes section below for more about this.
While I'm coyly saying "category" here, most often the category value is a customer/client specific identifier - as it is usually their registrations being handled. But any concept where there are runs of dates that neatly follow each other is amenable to this method.
The SQL I'm giving here derives identifiers for the subsequences - as GrpNum
- and then uses that for an overall summary of them and their boundary points in a final SELECT and GROUP BY.
It also "builds in" some small allowances for overlooking small gaps in the date spans, such as may happen due to weekends and pubilc holidays.
Anyway, here's the SQL
WITH
D AS (
SELECT
MAX( Src_From_Dt ) OVER (
PARTITION BY
Category_A ,
Category_B ,
Category_C
ORDER BY
Src_Upto_Dt ,
Src_From_Dt
ROWS BETWEEN
1 PRECEDING
AND
1 PRECEDING
) AS Prev_From_Dt , -- not used, here for inspections
MAX( Src_Upto_Dt ) OVER (
PARTITION BY
Category_A ,
Category_B ,
Category_C
ORDER BY
Src_Upto_Dt ,
Src_From_Dt
ROWS BETWEEN
1 PRECEDING
AND
1 PRECEDING
) AS Prev_Upto_Dt ,
A.*
FROM
DbsNm.TblNm AS A -- AS SrcTbl
) ,
E AS (
SELECT
CASE
WHEN
Prev_Upto_Dt IS NULL
OR
( Src_From_Dt - Prev_Upto_Dt ) IN ( 0, 1, 2, 3 ) THEN 'C' -- Continuous
ELSE 'D' -- Discontinuous
END AS Continuity ,
( --
ROW_NUMBER() OVER (
PARTITION BY
Category_A ,
Category_B ,
Category_C
ORDER BY
Src_From_Dt )
-
ROW_NUMBER() OVER (
PARTITION BY
Category_A ,
Category_B ,
Category_C ,
Continuity
ORDER BY
Src_From_Dt )
) AS GrpNum ,
D.*
FROM
D
) ,
F AS (
SELECT
Category_A ,
Category_B ,
Category_C ,
GrpNum ,
MIN( Src_From_Dt ) AS CntntyGrp_From_Dt ,
MAX( Src_Upto_Dt ) AS CntntyGrp_Upto_Dt , -- not used, here for inspections
COUNT(*) AS Rw_Cnt
FROM
E
GROUP BY
Category_A ,
Category_B ,
Category_C ,
GrpNum
) ,
G AS (
SELECT
Category_A ,
Category_B ,
Category_C ,
COUNT( DISTINCT CntntyGrp_From_Dt ) AS CntntyGrp_From_Dt_Ctd ,
COUNT(*) AS Rw_Cnt
FROM
F
GROUP BY
Category_A ,
Category_B ,
Category_C
)
SELECT
CntntyGrp_From_Dt_Ctd ,
MIN( Rw_Cnt ) AS Min_RwsPerSince ,
MAX( Rw_Cnt ) AS Max_RwsPerSince ,
COUNT(*) AS CategoryCombo_Cnt ,
MIN( Category_A ) AS Min_Category_A ,
MAX( Category_A ) AS Max_Category_A
FROM
G
GROUP BY
CntntyGrp_From_Dt_Ctd
ORDER BY
CategoryCombo_Cnt DESC ,
Max_RwsPerSince DESC
;
I wasn't kidding about this! Among the issues that arise, and which depend on quite what data is being stored are:
- whether the dates are inclusive or not (some programmers/designers/architects get it into their head that non-inclusive dates are clever) ;
- NULL replacement dates - e.g. 31 Dec 9999 for the "upto" date ;
- overlapping spans left in the data ;
- not storing as span pairs but as starts/ends of spans in separate rows ;
- cancellations of records ;
- odd, partial or complex status concepts - e.g. submitted but not approved - do you treat that as "registered" or not? I've deliberately left that kind of thing out of this example.
Luckily, regardless of what one finds, it is usually possble to write a View that handles those complications and gives a simpler set such as presumed by this example.
- I could probably do a post just about handling all of these issues if people are interested.
Using multiple ROW_NUMBER calls tends to give data system admins the heebie-geebies but sensible use is usually ok.
When I ran this on a table of 470 million rows, it took 5 minutes, and returned 54 patterns. Those patterns ranged from approximately 400 million with just one run of contiguous date spans, up to single instances with as many as 96 contiguous date spans.
For this kind of "do we have a problem?" analysis I'd call that a fair investment.
Most likely the 400 million probably had just one date span record each and so didn't really need to be handled anyway. It can therefore be a good idea to restrict the "source table" feed to just those known to have multiple records. As ever with these things, your mileage may vary and you don't usually know when the extra coding was worthwhile until after you have your proof/disproof.
This SQL came about as I was asked to assist with some analysis and in passing I observed that it was being assumed that all the dated records were without any gaps. That led me to show that whether this was a valid assumption (or not) could be checked by a method I'd used long ago.
Having thus put myself on the spot I then tackled proving there was such a method and thereby tested the assumption in the analysis.
At first my memory struggled but an Internet search led me to the following posting that reminded me well enough of the method for me to reconstruct it in full.
Resetting Row number according to record data change
As that spared me from combing my archives, I'd like to give the link here in appreciation.
As it happens, the "old method" I'd recalled was more about finding state changes in date sequences - but I was able to adapt it to the date span problem at hand. With that done, I felt I'd better ensure I had the state change sequences idea rendered in modern CTE style and so took some time to write out and test it. With that coded I felt I could make it into a generic form - and is what I've shared at SQL to derive subsequences
Then, I figured I should do the same with the code for the date span issue - and thus was that SQL adapted and this post written.
31