I have a table with 2 fields: unique ID, user ID (foreign key) and date-time. This is an access-log to a service. I work in SQL Server but I would appreciate agnostic answer
Database-agnostic, something of a variant of richardtallent's, but without the restrictions. (I'm using SQL Server 2008 here, but it shouldn't matter.)
Starting with this setup:
create table test(id int, userid int, time datetime)
insert into test values (1, 1, '2009-03-11 08:00')
insert into test values (2, 1, '2009-03-11 18:00')
insert into test values (3, 1, '2009-03-13 19:00')
insert into test values (4, 1, '2009-03-14 18:00')
Running this query:
select
starttime.id as gapid, starttime.time as starttime, endtime.time as endtime,
/* Replace next line with your DB's way of calculating the gap */
DATEDIFF(second, starttime.time, endtime.time) as gap
from
test as starttime
inner join test as endtime on
(starttime.userid = endtime.userid)
and (starttime.time < endtime.time)
left join test as intermediatetime on
(starttime.userid = intermediatetime.userid)
and (starttime.time < intermediatetime.time)
and (intermediatetime.time < endtime.time)
where
(intermediatetime.id is null)
Gives the following:
gapid starttime endtime gap
1 2009-03-11 08:00:00.000 2009-03-11 18:00:00.000 36000
2 2009-03-11 18:00:00.000 2009-03-13 19:00:00.000 176400
3 2009-03-13 19:00:00.000 2009-03-14 18:00:00.000 82800
You can then just ORDER BY
the gap expression descending, and pick the top result.
Some explanation:
WHERE
clause, though, filters these out (keeps only the rows with no intermediate row), hence keeping only {1+2+null, 2+3+null, 3+4+null}. Taa-daa!If you could, potentially, have the same time in there twice (a 'gap' of 0) then you'll need a way to break ties, as Dems points out. If you can use ID as a tie-breaker, then change e.g.
and (starttime.time < intermediatetime.time)
to
and ((starttime.time < intermediatetime.time)
or ((starttime.time = intermediatetime.time) and (starttime.id < intermediatetime.id)))
assuming that 'id' is a valid way to break ties.
In fact, if you know that ID will be monotonically increasing (I know you said 'not sequential,' but it's not clear if this means that they don't increase with each row, or just that the IDs of the two relevant entries may not be sequential because e.g. another user has entries in between), you can use ID instead of time in all the comparisons to make this even simpler.
First, join the table to itself so each record for a given user is paired with any record for that same user.
Then, select only those pairs where the first is before the last, there is no record before the first one, and no record after the last one.
SELECT t1.id, t1.[user-id], t1.time, (t2.time - t1.time) AS GapTime
FROM
t AS t1
INNER JOIN t AS t2 ON t1.[user-id] = t2.[user-id]
WHERE
t1.time < t2.time
AND NOT EXISTS (SELECT NULL FROM t AS t3 WHERE t3.[user-id] = t1.[user-id]
AND t3.time > t2.time)
AND NOT EXISTS (SELECT NULL FROM t AS t4 WHERE t4.[user-id] = t1.[user-id]
AND t4.time < t1.time)
Caveats:
If desired, you can fix #2 above by changing "t1.time < t2.time" to "t1.time <= t2.time", which will give you a gap of 0 if there is only one record for the user.
Join ranked Time on one-off rank to get the gap:
with cte_ranked as (
select *, row_number() over (partition by UserId order by Time) as rn
from table)
select l.*, datediff(minute, r.Time, l.Time) as gap_length
from cte_ranked l join cte_ranked r on l.UserId = r.UserId and l.rn = r.rn-1
You can then use many methods to identify the maximum gap, when it started etc.
Update
My original answer was written from a Mac w/o a database to test with. I had some more time to play with this problem and actually test and measure how it performs on a 1M records table. My test table is defined like this:
create table access (id int identity(1,1)
, UserId int not null
, Time datetime not null);
create clustered index cdx_access on access(UserID, Time);
go
For selecting the record for any information, my preferred answer so far is this:
with cte_gap as (
select Id, UserId, a.Time, (a.Time - prev.Time) as gap
from access a
cross apply (
select top(1) Time
from access b
where a.UserId = b.UserId
and a.Time > b.Time
order by Time desc) as prev)
, cte_max_gap as (
select UserId, max(gap) as max_gap
from cte_gap
group by UserId)
select g.*
from cte_gap g
join cte_max_gap m on m.UserId = g.UserId and m.max_gap = g.gap
where g.UserId = 42;
From 1M record, ~47k distinct users, the result for this is returned in 1ms on my test puny instance (warm cache), 48 page reads.
If the UserId=42 filter is removed the max gap and time it occurred for every user (with duplicates for multiple max gaps) need 6379139 reads, quite heavy, and takes 14s on my test machine.
The time can be cut in half if only the UserId and max gap is needed (no info when the max gap occurred):
select UserId, max(a.Time-prev.Time) as gap
from access a
cross apply (
select top(1) Time
from access b
where a.UserId = b.UserId
and a.Time > b.Time
order by Time desc
) as prev
group by UserId
This only needs 3193448 reads, only half compared to previous, and completed in 6 seconds on 1M records. The difference occurs because the previous version needed to evaluate every gap once to find the max one, then evaluate them again to find the ones that equal with the max. Note that for this performance results the structure of the table I proposed with an index on (UserId, Time) is critical.
As for the use of CTEs and 'partitions' (better known as ranking functions): this is all ANSI SQL-99 and is supported by most vendors. The only SQL Server specific construct was the use of the datediff
function, which is now removed. I have a feeling some readers understand 'agnostic' as 'least common denominator SQL understood also by my favorite vendor'. Also note that the use of common table expressions and cross apply operator are used solely to improve the readability of the query. Both can be replaced with derived table using a simple, mechanical, replacement. Here is the very same query where the CTEs where replaced with derived tables. I'll let you judge yourselves on its readability compared with the CTE based one:
select g.*
from (
select Id, UserId, a.Time, (a.Time - (
select top(1) Time
from access b
where a.UserId = b.UserId
and a.Time > b.Time
order by Time desc
)) as gap
from access a) as g
join (
select UserId, max(gap) as max_gap
from (
select Id, UserId, a.Time, (a.Time - (
select top(1) Time
from access b
where a.UserId = b.UserId
and a.Time > b.Time
order by Time desc
)) as gap
from access a) as cte_gap
group by UserId) as m on m.UserId = g.UserId and m.max_gap = g.gap
where g.UserId = 42
Damn, I was hopping will end up more convoluted lol. This is quite readable because it had only two CTEs to start from. Still, on queries with 5-6 derived tables, the CTE form is way, way more readable.
For completeness, here is the same transformation applied to my simplified query (only max gaps, no gap end time and access id):
select UserId, max(gap)
from (
select UserId, a.Time-(
select top(1) Time
from access b
where a.UserId = b.UserId
and a.Time > b.Time
order by Time desc) as gap
from access a) as gaps
group by UserId
Very similar to RichardTallent's answer...
SELECT
t1.id,
t1.[user-id],
t1.time,
DATEDIFF(s, t1.time, t2.time) AS GapTime
FROM
t AS t1
INNER JOIN
t AS t2
ON t2.[user-id] = t1.[user-id]
AND t2.time = (
SELECT
MIN(time)
FROM
t
WHERE
[user-id] = t1.[user-id]
AND time > t1.time
)
AS you are only actually using the time value from t2, you can actually re-organise as follows to deal with users with just one entry...
SELECT
t1.id,
t1.[user-id],
t1.time,
DATEDIFF(
s,
t1.time,
(
SELECT
MIN(time)
FROM
t
WHERE
[user-id] = t1.[user-id]
AND time > t1.time
)
) AS GapTime
FROM
t1
Finally, there is the possiblity of multiple entries with the same time stamp. When that happens we need additional info to decide the order allowing us to determine which record is 'next'.
Where there are several entries with the same time stamp, all bar one will have a GapTime of 0:
- '12:00' (Gap of 1 until next entry)
- '12:01' (Gap of 0 until next entry)
- '12:01' (Gap of 0 until next entry)
- '12:01' (Gap of 0 until next entry)
- '12:01' (Gap of 1 until next entry)
- '12:02' (Gap of NULL until next entry)
Only the one which is 'last' will have a non-zero time stamp. Although the question states that the "id" may not be in order, it is the only info we have for to determine which reocrd is 'last' when the timestamps are the same.
SELECT
t1.id,
t1.[user-id],
t1.time,
DATEDIFF(
s,
t1.time,
(
SELECT
MIN(time)
FROM
t
WHERE
[user-id] = t1.[user-id]
AND
(
(time > t1.time)
OR
(time = t1.time AND id > t1.id)
)
)
) AS GapTime
FROM
t1