SQL: finding longest date gap

前端 未结 4 1702
面向向阳花
面向向阳花 2020-12-11 04:34

I have a table with 2 fields: unique ID, user ID (foreign key) and date-time. This is an access-log to a service. I work in SQL Server but I would appreciate agnostic answer

相关标签:
4条回答
  • 2020-12-11 04:54

    Database-agnostic, something of a variant of richardtallent's, but without the restrictions. (I'm using SQL Server 2008 here, but it shouldn't matter.)

    Starting with this setup:

    create table test(id int, userid int, time datetime)
    insert into test values (1, 1, '2009-03-11 08:00')
    insert into test values (2, 1, '2009-03-11 18:00')
    insert into test values (3, 1, '2009-03-13 19:00')
    insert into test values (4, 1, '2009-03-14 18:00')
    

    Running this query:

    select 
      starttime.id as gapid, starttime.time as starttime, endtime.time as endtime, 
      /* Replace next line with your DB's way of calculating the gap */
      DATEDIFF(second, starttime.time, endtime.time) as gap
    from 
      test as starttime
    inner join test as endtime on 
      (starttime.userid = endtime.userid) 
      and (starttime.time < endtime.time) 
    left join test as intermediatetime on 
      (starttime.userid = intermediatetime.userid) 
      and (starttime.time < intermediatetime.time) 
      and (intermediatetime.time < endtime.time) 
    where 
      (intermediatetime.id is null)
    

    Gives the following:

    gapid  starttime                endtime                  gap
    1      2009-03-11 08:00:00.000  2009-03-11 18:00:00.000  36000
    2      2009-03-11 18:00:00.000  2009-03-13 19:00:00.000  176400
    3      2009-03-13 19:00:00.000  2009-03-14 18:00:00.000  82800
    

    You can then just ORDER BY the gap expression descending, and pick the top result.

    Some explanation:

    • Like richardtallent's answer, you join the table onto itself to find a 'later' record – this basically pairs all records with ANY of their later records, here pairing {1+2, 1+3, 1+4, 2+3, 2+4, 3+4}.
    • Then there's another self-join, this time a left join, to find rows in between the two previously selected so {1+2+null, 1+3+2, 1+4+2, 1+4+3, 2+3+null, 2+4+3, 3+4+null}.
    • The WHERE clause, though, filters these out (keeps only the rows with no intermediate row), hence keeping only {1+2+null, 2+3+null, 3+4+null}. Taa-daa!

    If you could, potentially, have the same time in there twice (a 'gap' of 0) then you'll need a way to break ties, as Dems points out. If you can use ID as a tie-breaker, then change e.g.

    and (starttime.time < intermediatetime.time) 
    

    to

    and ((starttime.time < intermediatetime.time) 
      or ((starttime.time = intermediatetime.time) and (starttime.id < intermediatetime.id)))
    

    assuming that 'id' is a valid way to break ties.

    In fact, if you know that ID will be monotonically increasing (I know you said 'not sequential,' but it's not clear if this means that they don't increase with each row, or just that the IDs of the two relevant entries may not be sequential because e.g. another user has entries in between), you can use ID instead of time in all the comparisons to make this even simpler.

    0 讨论(0)
  • 2020-12-11 05:02

    First, join the table to itself so each record for a given user is paired with any record for that same user.

    Then, select only those pairs where the first is before the last, there is no record before the first one, and no record after the last one.

     SELECT t1.id, t1.[user-id], t1.time, (t2.time - t1.time) AS GapTime
     FROM
         t AS t1
         INNER JOIN t AS t2 ON t1.[user-id] = t2.[user-id]
     WHERE
         t1.time < t2.time
         AND NOT EXISTS (SELECT NULL FROM t AS t3 WHERE t3.[user-id] = t1.[user-id]
             AND t3.time > t2.time)
         AND NOT EXISTS (SELECT NULL FROM t AS t4 WHERE t4.[user-id] = t1.[user-id]
             AND t4.time < t1.time)
    

    Caveats:

    1. Does not return users that have 0 or 1 records.
    2. Does not return users where all records have the same date/time.
    3. Will return multiple records for a user if the user has duplicate records on the starting or ending boundary of their largest gap.

    If desired, you can fix #2 above by changing "t1.time < t2.time" to "t1.time <= t2.time", which will give you a gap of 0 if there is only one record for the user.

    0 讨论(0)
  • 2020-12-11 05:03

    Join ranked Time on one-off rank to get the gap:

    with cte_ranked as (
    select *, row_number() over (partition by UserId order by Time) as rn
    from table)
    select l.*, datediff(minute, r.Time, l.Time) as gap_length
    from cte_ranked l join cte_ranked r on l.UserId = r.UserId and l.rn = r.rn-1
    

    You can then use many methods to identify the maximum gap, when it started etc.

    Update

    My original answer was written from a Mac w/o a database to test with. I had some more time to play with this problem and actually test and measure how it performs on a 1M records table. My test table is defined like this:

    create table access (id int identity(1,1)
        , UserId int not null
        , Time datetime not null);
    create clustered index cdx_access on access(UserID, Time);
    go
    

    For selecting the record for any information, my preferred answer so far is this:

    with cte_gap as (
        select Id, UserId, a.Time, (a.Time - prev.Time) as gap
        from access a
        cross apply (
            select top(1) Time 
            from access b
            where a.UserId = b.UserId
                and a.Time > b.Time
            order by Time desc) as prev)
    , cte_max_gap as (
        select UserId, max(gap) as max_gap
        from cte_gap
        group by UserId)
    select g.* 
        from cte_gap g
        join cte_max_gap m on m.UserId = g.UserId and m.max_gap = g.gap
    where g.UserId = 42;
    

    From 1M record, ~47k distinct users, the result for this is returned in 1ms on my test puny instance (warm cache), 48 page reads.

    If the UserId=42 filter is removed the max gap and time it occurred for every user (with duplicates for multiple max gaps) need 6379139 reads, quite heavy, and takes 14s on my test machine.

    The time can be cut in half if only the UserId and max gap is needed (no info when the max gap occurred):

    select UserId, max(a.Time-prev.Time) as gap
        from access a
        cross apply (
            select top(1) Time 
            from access b
            where a.UserId = b.UserId
                and a.Time > b.Time
            order by Time desc
        ) as prev
    group by UserId
    

    This only needs 3193448 reads, only half compared to previous, and completed in 6 seconds on 1M records. The difference occurs because the previous version needed to evaluate every gap once to find the max one, then evaluate them again to find the ones that equal with the max. Note that for this performance results the structure of the table I proposed with an index on (UserId, Time) is critical.

    As for the use of CTEs and 'partitions' (better known as ranking functions): this is all ANSI SQL-99 and is supported by most vendors. The only SQL Server specific construct was the use of the datediff function, which is now removed. I have a feeling some readers understand 'agnostic' as 'least common denominator SQL understood also by my favorite vendor'. Also note that the use of common table expressions and cross apply operator are used solely to improve the readability of the query. Both can be replaced with derived table using a simple, mechanical, replacement. Here is the very same query where the CTEs where replaced with derived tables. I'll let you judge yourselves on its readability compared with the CTE based one:

    select g.*
        from (    
            select Id, UserId, a.Time, (a.Time - (
                select top(1) Time 
                from access b
                where a.UserId = b.UserId
                    and a.Time > b.Time
                order by Time desc
            )) as gap
            from access a) as g
        join (
            select UserId, max(gap) as max_gap
                from (
                    select Id, UserId, a.Time, (a.Time - (
                       select top(1) Time 
                       from access b
                       where a.UserId = b.UserId
                         and a.Time > b.Time
                       order by Time desc
                       )) as gap
                from access a) as cte_gap
            group by UserId) as m on m.UserId = g.UserId and m.max_gap = g.gap
        where g.UserId = 42
    

    Damn, I was hopping will end up more convoluted lol. This is quite readable because it had only two CTEs to start from. Still, on queries with 5-6 derived tables, the CTE form is way, way more readable.

    For completeness, here is the same transformation applied to my simplified query (only max gaps, no gap end time and access id):

    select UserId, max(gap)
        from (
            select UserId, a.Time-(
                select top(1) Time 
                from access b
                where a.UserId = b.UserId
                    and a.Time > b.Time
                order by Time desc) as gap
        from access a) as gaps
    group by UserId
    
    0 讨论(0)
  • 2020-12-11 05:15

    Very similar to RichardTallent's answer...

    SELECT
       t1.id,
       t1.[user-id],
       t1.time,
       DATEDIFF(s, t1.time, t2.time) AS GapTime
    FROM
       t AS t1
    INNER JOIN
       t AS t2
          ON  t2.[user-id] = t1.[user-id]
          AND t2.time = (
             SELECT
                MIN(time)
             FROM
                t
             WHERE
                [user-id] = t1.[user-id]
                AND time > t1.time
          )
    


    AS you are only actually using the time value from t2, you can actually re-organise as follows to deal with users with just one entry...

    SELECT
       t1.id,
       t1.[user-id],
       t1.time,
       DATEDIFF(
          s,
          t1.time,
          (
             SELECT
                MIN(time)
             FROM
                t
             WHERE
                [user-id] = t1.[user-id]
                AND time > t1.time
          )
       ) AS GapTime
    FROM
       t1
    


    Finally, there is the possiblity of multiple entries with the same time stamp. When that happens we need additional info to decide the order allowing us to determine which record is 'next'.

    Where there are several entries with the same time stamp, all bar one will have a GapTime of 0:
    - '12:00' (Gap of 1 until next entry)
    - '12:01' (Gap of 0 until next entry)
    - '12:01' (Gap of 0 until next entry)
    - '12:01' (Gap of 0 until next entry)
    - '12:01' (Gap of 1 until next entry)

    - '12:02' (Gap of NULL until next entry)

    Only the one which is 'last' will have a non-zero time stamp. Although the question states that the "id" may not be in order, it is the only info we have for to determine which reocrd is 'last' when the timestamps are the same.

    SELECT
       t1.id,
       t1.[user-id],
       t1.time,
       DATEDIFF(
          s,
          t1.time,
          (
             SELECT
                MIN(time)
             FROM
                t
             WHERE
                [user-id] = t1.[user-id]
                AND
                (
                   (time > t1.time)
                   OR
                   (time = t1.time AND id > t1.id)
                )
          )
       ) AS GapTime
    FROM
       t1
    
    0 讨论(0)
提交回复
热议问题