SQL to determine minimum sequential days of access?

前端 未结 19 1695
我在风中等你
我在风中等你 2020-12-04 04:58

The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of r

相关标签:
19条回答
  • 2020-12-04 05:28

    This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).

    DECLARE @Range as INT
    SET @Range = 10
    
    SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))
      FROM tblUserLogin a
    WHERE EXISTS
       (SELECT 1 
          FROM tblUserLogin b 
         WHERE a.userId = b.userId 
           AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate))))) 
                  FROM tblUserLogin c 
                 WHERE c.userid = b.userid 
                   AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+@Range-1) = @Range)
    

    Creation script

    CREATE TABLE [dbo].[tblUserLogin](
        [Id] [int] IDENTITY(1,1) NOT NULL,
        [UserId] [int] NULL,
        [CreationDate] [datetime] NULL
    ) ON [PRIMARY]
    
    0 讨论(0)
  • 2020-12-04 05:29

    The answer is obviously:

    SELECT DISTINCT UserId
    FROM UserHistory uh1
    WHERE (
           SELECT COUNT(*) 
           FROM UserHistory uh2 
           WHERE uh2.CreationDate 
           BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate)
          ) = @days OR UserId = 52551
    

    EDIT:

    Okay here's my serious answer:

    DECLARE @days int
    DECLARE @seconds bigint
    SET @days = 30
    SET @seconds = (@days * 24 * 60 * 60) - 1
    SELECT DISTINCT UserId
    FROM (
        SELECT uh1.UserId, Count(uh1.Id) as Conseq
        FROM UserHistory uh1
        INNER JOIN UserHistory uh2 ON uh2.CreationDate 
            BETWEEN uh1.CreationDate AND 
                DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
            AND uh1.UserId = uh2.UserId
        GROUP BY uh1.Id, uh1.UserId
        ) as Tbl
    WHERE Conseq >= @days
    

    EDIT:

    [Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!

    0 讨论(0)
  • 2020-12-04 05:29

    How about (and please make sure the previous statement ended with a semi-colon):

    WITH numberedrows
         AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID 
                                           ORDER BY CreationDate)
                    - DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
                    CreationDate,
                    UserID
             FROM   tablename)
    SELECT MIN(CreationDate),
           MAX(CreationDate),
           COUNT(*) AS NumConsecutiveDays,
           UserID
    FROM   numberedrows
    GROUP  BY UserID,
              TheOffset  
    

    The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.

    You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...

    I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.

    ...and would be very much helped by an index on tablename(UserID, CreationDate)

    Edited: Turns out Offset is a reserved word, so I used TheOffset instead.

    Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.

    Rob

    0 讨论(0)
  • 2020-12-04 05:30

    assuming a schema that goes like:

    create table dba.visits
    (
        id  integer not null,
        user_id integer not null,
        creation_date date not null
    );
    

    this will extract contiguous ranges from a date sequence with gaps.

    select l.creation_date  as start_d, -- Get first date in contiguous range
        (
            select min(a.creation_date ) as creation_date 
            from "DBA"."visits" a 
                left outer join "DBA"."visits" b on 
                       a.creation_date = dateadd(day, -1, b.creation_date ) and 
                       a.user_id  = b.user_id 
                where b.creation_date  is null and
                      a.creation_date  >= l.creation_date  and
                      a.user_id  = l.user_id 
        ) as end_d -- Get last date in contiguous range
    from  "DBA"."visits" l
        left outer join "DBA"."visits" r on 
            r.creation_date  = dateadd(day, -1, l.creation_date ) and 
            r.user_id  = l.user_id 
        where r.creation_date  is null
    
    0 讨论(0)
  • 2020-12-04 05:32

    Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is @Days)

    Another lead is from SQL Team's blog on runs

    The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:

    WITH Runs
    AS
      (SELECT UserID
             , CreationDate
             , ROW_NUMBER() OVER(PARTITION BY UserId
                                 ORDER BY CreationDate)
               - ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
                                   ORDER BY CreationDate) AS RunNumber
      FROM
         (SELECT UH.UserID
               , UH.CreationDate
               , ISNULL((SELECT TOP 1 1 
                  FROM dbo.UserHistory AS Prior 
                  WHERE Prior.UserId = UH.UserId 
                  AND Prior.CreationDate
                      BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
                      AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
          FROM dbo.UserHistory AS UH) AS Consecutive
    )
    SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
    FROM Runs
    GROUP BY UserID, RunNumber
    HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days
    

    The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.

    0 讨论(0)
  • 2020-12-04 05:32

    I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.

    Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):

    -- show basic understand of the math properties 
      select    ceil(max (creation_date) - min (creation_date))
                  max_min_days_diff,
               count ( * ) real_day_count
        from   user_access_log
    group by   user_id;
    
    
    -- select all users that have consecutively accessed the site 
      select   user_id
        from   user_access_log
    group by   user_id
      having       ceil(max (creation_date) - min (creation_date))
               / count ( * ) = 1;
    
    
    
    -- get the count of all users that have consecutively accessed the site 
      select   count(user_id) user_count
        from   user_access_log
    group by   user_id
      having   ceil(max (creation_date) - min (creation_date))
               / count ( * ) = 1;
    

    Table prep script:

    -- create table 
    create table user_access_log (id           number, user_id      number, creation_date date);
    
    
    -- insert seed data 
    insert into user_access_log (id, user_id, creation_date)
      values   (1, 12, sysdate);
    
    insert into user_access_log (id, user_id, creation_date)
      values   (2, 12, sysdate + 1);
    
    insert into user_access_log (id, user_id, creation_date)
      values   (3, 12, sysdate + 2);
    
    insert into user_access_log (id, user_id, creation_date)
      values   (4, 16, sysdate);
    
    insert into user_access_log (id, user_id, creation_date)
      values   (5, 16, sysdate + 1);
    
    insert into user_access_log (id, user_id, creation_date)
      values   (6, 16, sysdate + 5);
    
    0 讨论(0)
提交回复
热议问题