Redshift - Calculate monthly active users

问题

I have a table which looks like this:

Date       | User_ID
2017-1-1   |  1
2017-1-1   |  2
2017-1-1   |  4
2017-1-2   |  3
2017-1-2   |  2
...        |  ..
...        |  ..
...        |  ..
...        |  ..
2017-2-1   |  1
2017-2-2   |  2
...        |  ..
...        |  ..
...        |  ..

I'd like to calculate the monthly active users over a rolling 30 day period. I know Redshift does not do COUNT(DISTINCT)) windowing. What can I do to get the following output?

Date      | MAU
2017-1-1  | 3
2017-1-2  | 4    <- We don't want to count user_id 2 twice.
...       | ..
...       | ..
...       | ..
2017-2-1  | ..
2017-2-2  | ..
...       | ..
...       | ..

I attempted to do this (and clearly failed). Here's my code:

SELECT event_date
    ,sum(user_count) mau_count
    ,CASE
        WHEN event_date = date_trunc('week', event_date)
            THEN 1
        ELSE 0
        END week_starting FROM (
    SELECT event_date
        ,count(*) OVER (PARTITION BY event_date ORDER BY event_date ROWS BETWEEN 30 PRECEDING
                    AND CURRENT ROW
            ) AS user_count    <-- I know this is wrong. Just my attempt :)
    FROM (
        SELECT DISTINCT (user_id)
            ,event_date
        FROM event_table
        ) daily_distinct_users
    GROUP BY event_date
    ) cumulative_daily_distinct_users GROUP BY event_date;

Please let me know how I can get the MAU count accurately. Thanks!

回答1:

This one seems to work (column names in the log table are dt and userid):

SELECT
  end_date,
  -- The number of distinct users during the 30 days prior
  COUNT(DISTINCT userid) distinct_users
FROM log
JOIN
( -- A list of dates to appear in the output first column
  SELECT DISTINCT dt AS end_date
  FROM log
  WHERE dt BETWEEN date '2017-01-01' AND date '2017-01-31'
) ON dt BETWEEN end_date - interval '30 days' AND end_date
GROUP BY end_date
ORDER BY end_date

Basically, the sub-select generates a list of end_dates that appear as the first output column. Then, it joins to the distinct number of userid that appear during the 30 days prior to the selected date.

回答2:

Assuming there will be no missing dates, you can first get the first date a user appeared on using MIN function. Then get the user count for each date and then use the SUM function to get the rolling sum.

SELECT DISTINCT EVENT_DATE,
SUM(CNT) OVER(ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS MAU
FROM
 (SELECT E.EVENT_DATE,
         COUNT(DISTINCT T.USER_ID) AS CNT
  FROM EVENT_TABLE E
  LEFT JOIN
   (SELECT DISTINCT USER_ID,
     MIN(EVENT_DATE) OVER(PARTITION BY USER_ID
                          ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS FIRST_APPEARED_ON
    FROM EVENT_TABLE 
   ) T ON T.FIRST_APPEARED_ON=E.EVENT_DATE AND T.USER_ID=E.USER_ID
  GROUP BY E.EVENT_DATE
) T1

Sample Demo using SQL Server

回答3:

@John Rotenstein's answer works well.

For those who stumble across this question and are looking for something a little more, the following blog post describes an alternative precomputation strategy for computing rolling MAUs quickly. It's overkill for the question here but might come in handy in case you:

are exasperated with the slow speed of growth metric calculations for interactive queries,
need to compute other rolling growth metrics (e.g., registrations, activation, retention, reactivation), or
regularly perform analyses that involve some type of rolling user count.

来源：https://stackoverflow.com/questions/42261489/redshift-calculate-monthly-active-users

标签

sql

aggregate

aggregate-functions

amazon-redshift