问题
I have a table which looks like this:
Date | User_ID
2017-1-1 | 1
2017-1-1 | 2
2017-1-1 | 4
2017-1-2 | 3
2017-1-2 | 2
... | ..
... | ..
... | ..
... | ..
2017-2-1 | 1
2017-2-2 | 2
... | ..
... | ..
... | ..
I'd like to calculate the monthly active users over a rolling 30 day period. I know Redshift does not do COUNT(DISTINCT)) windowing. What can I do to get the following output?
Date | MAU
2017-1-1 | 3
2017-1-2 | 4 <- We don't want to count user_id 2 twice.
... | ..
... | ..
... | ..
2017-2-1 | ..
2017-2-2 | ..
... | ..
... | ..
I attempted to do this (and clearly failed). Here's my code:
SELECT event_date
,sum(user_count) mau_count
,CASE
WHEN event_date = date_trunc('week', event_date)
THEN 1
ELSE 0
END week_starting FROM (
SELECT event_date
,count(*) OVER (PARTITION BY event_date ORDER BY event_date ROWS BETWEEN 30 PRECEDING
AND CURRENT ROW
) AS user_count <-- I know this is wrong. Just my attempt :)
FROM (
SELECT DISTINCT (user_id)
,event_date
FROM event_table
) daily_distinct_users
GROUP BY event_date
) cumulative_daily_distinct_users GROUP BY event_date;
Please let me know how I can get the MAU count accurately. Thanks!
回答1:
This one seems to work (column names in the log
table are dt
and userid
):
SELECT
end_date,
-- The number of distinct users during the 30 days prior
COUNT(DISTINCT userid) distinct_users
FROM log
JOIN
( -- A list of dates to appear in the output first column
SELECT DISTINCT dt AS end_date
FROM log
WHERE dt BETWEEN date '2017-01-01' AND date '2017-01-31'
) ON dt BETWEEN end_date - interval '30 days' AND end_date
GROUP BY end_date
ORDER BY end_date
Basically, the sub-select generates a list of end_dates
that appear as the first output column. Then, it joins to the distinct number of userid
that appear during the 30 days prior to the selected date.
回答2:
Assuming there will be no missing dates, you can first get the first date a user appeared on using MIN
function. Then get the user count for each date and then use the SUM
function to get the rolling sum.
SELECT DISTINCT EVENT_DATE,
SUM(CNT) OVER(ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS MAU
FROM
(SELECT E.EVENT_DATE,
COUNT(DISTINCT T.USER_ID) AS CNT
FROM EVENT_TABLE E
LEFT JOIN
(SELECT DISTINCT USER_ID,
MIN(EVENT_DATE) OVER(PARTITION BY USER_ID
ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS FIRST_APPEARED_ON
FROM EVENT_TABLE
) T ON T.FIRST_APPEARED_ON=E.EVENT_DATE AND T.USER_ID=E.USER_ID
GROUP BY E.EVENT_DATE
) T1
Sample Demo using SQL Server
回答3:
@John Rotenstein's answer works well.
For those who stumble across this question and are looking for something a little more, the following blog post describes an alternative precomputation strategy for computing rolling MAUs quickly. It's overkill for the question here but might come in handy in case you:
- are exasperated with the slow speed of growth metric calculations for interactive queries,
- need to compute other rolling growth metrics (e.g., registrations, activation, retention, reactivation), or
- regularly perform analyses that involve some type of rolling user count.
来源:https://stackoverflow.com/questions/42261489/redshift-calculate-monthly-active-users