I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan. The sample data posted
Here is a SQL only solution. I used DATETIME for the columns. Storing the time separate is a mistake in my opinion, as you will have problems when the times go past midnight. You can adjust this to handle that situation though if you need to. The solution also assumes that the start and end times are NOT NULL. Again, you can adjust as needed if that's not the case.
The general gist of the solution is to get all of the start times that don't overlap with any other spans, get all of the end times that don't overlap with any spans, then match the two together.
The results match your expected results except in one case, which checking by hand looks like you have a mistake in your expected output. On the 6th there should be a span that ends at 2009-06-06 10:18:45.000.
SELECT
ST.start_time,
ET.end_time
FROM
(
SELECT
T1.start_time
FROM
dbo.Test_Time_Spans T1
LEFT OUTER JOIN dbo.Test_Time_Spans T2 ON
T2.start_time < T1.start_time AND
T2.end_time >= T1.start_time
WHERE
T2.start_time IS NULL
) AS ST
INNER JOIN
(
SELECT
T3.end_time
FROM
dbo.Test_Time_Spans T3
LEFT OUTER JOIN dbo.Test_Time_Spans T4 ON
T4.end_time > T3.end_time AND
T4.start_time <= T3.end_time
WHERE
T4.start_time IS NULL
) AS ET ON
ET.end_time > ST.start_time
LEFT OUTER JOIN
(
SELECT
T5.end_time
FROM
dbo.Test_Time_Spans T5
LEFT OUTER JOIN dbo.Test_Time_Spans T6 ON
T6.end_time > T5.end_time AND
T6.start_time <= T5.end_time
WHERE
T6.start_time IS NULL
) AS ET2 ON
ET2.end_time > ST.start_time AND
ET2.end_time < ET.end_time
WHERE
ET2.end_time IS NULL
In MySQL
:
SELECT grouper, MIN(start) AS group_start, MAX(end) AS group_end
FROM (
SELECT start,
end,
@r := @r + (@edate < start) AS grouper,
@edate := GREATEST(end, CAST(@edate AS DATETIME))
FROM (
SELECT @r := 0,
@edate := CAST('0000-01-01' AS DATETIME)
) vars,
(
SELECT rn_date + INTERVAL TIME_TO_SEC(rn_start) SECOND AS start,
rn_date + INTERVAL TIME_TO_SEC(rn_end) SECOND + INTERVAL (rn_start > rn_end) DAY AS end
FROM t_ranges
) q
ORDER BY
start
) q
GROUP BY
grouper
ORDER BY
group_start
Same decision for SQL Server
is described in the following article in my blog:
Here's the function to do this:
DROP FUNCTION fn_spans
GO
CREATE FUNCTION fn_spans(@p_from DATETIME, @p_till DATETIME)
RETURNS @t TABLE
(
q_start DATETIME NOT NULL,
q_end DATETIME NOT NULL
)
AS
BEGIN
DECLARE @qs DATETIME
DECLARE @qe DATETIME
DECLARE @ms DATETIME
DECLARE @me DATETIME
DECLARE cr_span CURSOR FAST_FORWARD
FOR
SELECT s_date + s_start AS q_start,
s_date + s_stop + CASE WHEN s_start < s_stop THEN 0 ELSE 1 END AS q_end
FROM t_span
WHERE s_date BETWEEN @p_from - 1 AND @p_till
AND s_date + s_start >= @p_from
AND s_date + s_stop <= @p_till
ORDER BY
q_start
OPEN cr_span
FETCH NEXT
FROM cr_span
INTO @qs, @qe
SET @ms = @qs
SET @me = @qe
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXT
FROM cr_span
INTO @qs, @qe
IF @qs > @me
BEGIN
INSERT
INTO @t
VALUES (@ms, @me)
SET @ms = @qs
END
SET @me = CASE WHEN @qe > @me THEN @qe ELSE @me END
END
IF @ms IS NOT NULL
BEGIN
INSERT
INTO @t
VALUES (@ms, @me)
END
CLOSE cr_span
RETURN
END
Since SQL Server
lacks an easy way to refer to previously selected rows in a resultset, this is one of rare cases when cursors in SQL Server
work faster than set-based decisions.
Tested on 1,440,000
rows, works for 24
seconds for the full set, and almost instant for a range of day or two.
Note the additional condition in the SELECT
query:
s_date BETWEEN @p_from - 1 AND @p_till
This seems to be redundant, but it is actually a coarse filter to make your index on s_date
usable.
Assuming you:
Do the following:
first = first row in L
flat_date.start = first.start, flat_date.end = first.end
For each row in L:
if row.start < flat_date.end and row.end > flat_date.end: // adding on to a timespan
flat_date.end = row.end
else: // ending a timespan and starting a new one
add flat_date to F
flat_date.start = row.start, flat_date.end = row.end
add flat_date to F // adding the last timespan to the flattened list
Extending on MahlerFive answer I wrote a swift extension to DateTools. So far it has passed all my tests.
extension DTTimePeriodCollection {
func flatten() {
self.sortByStartAscending()
guard let periods = self.periods() else { return }
if periods.count < 1 { return }
var flattenedPeriods = [DTTimePeriod]()
let flatdate = DTTimePeriod()
for period in periods {
guard let periodStart = period.StartDate, let periodEnd = period.EndDate else { continue }
if !flatdate.hasStartDate() { flatdate.StartDate = periodStart }
if !flatdate.hasEndDate() { flatdate.EndDate = periodEnd }
if periodStart.isEarlierThanOrEqualTo(flatdate.EndDate) && periodEnd.isGreaterThanOrEqualTo(flatdate.EndDate) {
flatdate.EndDate = periodEnd
} else {
flattenedPeriods.append(flatdate.copy())
flatdate.StartDate = periodStart
flatdate.EndDate = periodEnd
}
}
flattenedPeriods.append(flatdate.copy())
// delete all periods
for var i = 0 ; i < periods.count ; i++ { self.removeTimePeriodAtIndex(0) }
// add flattened periods to self
for flat in flattenedPeriods { self.addTimePeriod(flat) }
}
Here is a recursive CTE solution, but I took the liberty of assigning a date and time to each column rather than pulling the date out separately. Helps to avoid some messy special case code. If you must store the date separately, I would use a view of CTE to make it look like two datetime columns and go with this approach.
create test data:
create table t1 (d1 datetime, d2 datetime)
insert t1 (d1,d2)
select '2009-06-03 10:00:00', '2009-06-03 14:00:00'
union all select '2009-06-03 13:55:00', '2009-06-03 18:00:00'
union all select '2009-06-03 17:55:00', '2009-06-03 23:00:00'
union all select '2009-06-03 22:55:00', '2009-06-04 03:00:00'
union all select '2009-06-04 03:05:00', '2009-06-04 07:00:00'
union all select '2009-06-04 07:05:00', '2009-06-04 10:00:00'
union all select '2009-06-04 09:55:00', '2009-06-04 14:00:00'
Recursive CTE:
;with dateRanges (ancestorD1, parentD1, d2, iter) as
(
--anchor is first level of collapse
select
d1 as ancestorD1,
d1 as parentD1,
d2,
cast(0 as int) as iter
from t1
--recurse as long as there is another range to fold in
union all select
tLeft.ancestorD1,
tRight.d1 as parentD1,
tRight.d2,
iter + 1 as iter
from dateRanges as tLeft join t1 as tRight
--join condition is that the t1 row can be consumed by the recursive row
on tLeft.d2 between tRight.d1 and tRight.d2
--exclude identical rows
and not (tLeft.parentD1 = tRight.d1 and tLeft.d2 = tRight.d2)
)
select
ranges1.*
from dateRanges as ranges1
where not exists (
select 1
from dateRanges as ranges2
where ranges1.ancestorD1 between ranges2.ancestorD1 and ranges2.d2
and ranges1.d2 between ranges2.ancestorD1 and ranges2.d2
and ranges2.iter > ranges1.iter
)
Gives output:
ancestorD1 parentD1 d2 iter
----------------------- ----------------------- ----------------------- -----------
2009-06-04 03:05:00.000 2009-06-04 03:05:00.000 2009-06-04 07:00:00.000 0
2009-06-04 07:05:00.000 2009-06-04 09:55:00.000 2009-06-04 14:00:00.000 1
2009-06-03 10:00:00.000 2009-06-03 22:55:00.000 2009-06-04 03:00:00.000 3
Similar question on SO here:
Min effective and termdate for contiguous dates
FWIW I up-voted the one that recommended Joe Celko's SQL For Smarties, Third Edition -- repeat: Third Edition (2005) -- which discusses various approaches, set base and procedural.