The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of r
This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).
DECLARE @Range as INT SET @Range = 10 SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) FROM tblUserLogin a WHERE EXISTS (SELECT 1 FROM tblUserLogin b WHERE a.userId = b.userId AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate))))) FROM tblUserLogin c WHERE c.userid = b.userid AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+@Range-1) = @Range)
Creation script
CREATE TABLE [dbo].[tblUserLogin]( [Id] [int] IDENTITY(1,1) NOT NULL, [UserId] [int] NULL, [CreationDate] [datetime] NULL ) ON [PRIMARY]
The answer is obviously:
SELECT DISTINCT UserId
FROM UserHistory uh1
WHERE (
SELECT COUNT(*)
FROM UserHistory uh2
WHERE uh2.CreationDate
BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate)
) = @days OR UserId = 52551
EDIT:
Okay here's my serious answer:
DECLARE @days int
DECLARE @seconds bigint
SET @days = 30
SET @seconds = (@days * 24 * 60 * 60) - 1
SELECT DISTINCT UserId
FROM (
SELECT uh1.UserId, Count(uh1.Id) as Conseq
FROM UserHistory uh1
INNER JOIN UserHistory uh2 ON uh2.CreationDate
BETWEEN uh1.CreationDate AND
DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
AND uh1.UserId = uh2.UserId
GROUP BY uh1.Id, uh1.UserId
) as Tbl
WHERE Conseq >= @days
EDIT:
[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!
How about (and please make sure the previous statement ended with a semi-colon):
WITH numberedrows
AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID
ORDER BY CreationDate)
- DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
CreationDate,
UserID
FROM tablename)
SELECT MIN(CreationDate),
MAX(CreationDate),
COUNT(*) AS NumConsecutiveDays,
UserID
FROM numberedrows
GROUP BY UserID,
TheOffset
The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.
You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...
I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.
...and would be very much helped by an index on tablename(UserID, CreationDate)
Edited: Turns out Offset is a reserved word, so I used TheOffset instead.
Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.
Rob
assuming a schema that goes like:
create table dba.visits
(
id integer not null,
user_id integer not null,
creation_date date not null
);
this will extract contiguous ranges from a date sequence with gaps.
select l.creation_date as start_d, -- Get first date in contiguous range
(
select min(a.creation_date ) as creation_date
from "DBA"."visits" a
left outer join "DBA"."visits" b on
a.creation_date = dateadd(day, -1, b.creation_date ) and
a.user_id = b.user_id
where b.creation_date is null and
a.creation_date >= l.creation_date and
a.user_id = l.user_id
) as end_d -- Get last date in contiguous range
from "DBA"."visits" l
left outer join "DBA"."visits" r on
r.creation_date = dateadd(day, -1, l.creation_date ) and
r.user_id = l.user_id
where r.creation_date is null
Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is @Days)
Another lead is from SQL Team's blog on runs
The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:
WITH Runs
AS
(SELECT UserID
, CreationDate
, ROW_NUMBER() OVER(PARTITION BY UserId
ORDER BY CreationDate)
- ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
ORDER BY CreationDate) AS RunNumber
FROM
(SELECT UH.UserID
, UH.CreationDate
, ISNULL((SELECT TOP 1 1
FROM dbo.UserHistory AS Prior
WHERE Prior.UserId = UH.UserId
AND Prior.CreationDate
BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
FROM dbo.UserHistory AS UH) AS Consecutive
)
SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
FROM Runs
GROUP BY UserID, RunNumber
HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days
The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.
I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.
Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):
-- show basic understand of the math properties
select ceil(max (creation_date) - min (creation_date))
max_min_days_diff,
count ( * ) real_day_count
from user_access_log
group by user_id;
-- select all users that have consecutively accessed the site
select user_id
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
-- get the count of all users that have consecutively accessed the site
select count(user_id) user_count
from user_access_log
group by user_id
having ceil(max (creation_date) - min (creation_date))
/ count ( * ) = 1;
Table prep script:
-- create table
create table user_access_log (id number, user_id number, creation_date date);
-- insert seed data
insert into user_access_log (id, user_id, creation_date)
values (1, 12, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (2, 12, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (3, 12, sysdate + 2);
insert into user_access_log (id, user_id, creation_date)
values (4, 16, sysdate);
insert into user_access_log (id, user_id, creation_date)
values (5, 16, sysdate + 1);
insert into user_access_log (id, user_id, creation_date)
values (6, 16, sysdate + 5);