Redshift - Calculate monthly active users

女生的网名这么多〃 提交于 2020-01-03 01:41:43

问题


I have a table which looks like this:

Date       | User_ID
2017-1-1   |  1
2017-1-1   |  2
2017-1-1   |  4
2017-1-2   |  3
2017-1-2   |  2
...        |  ..
...        |  ..
...        |  ..
...        |  ..
2017-2-1   |  1
2017-2-2   |  2
...        |  ..
...        |  ..
...        |  ..

I'd like to calculate the monthly active users over a rolling 30 day period. I know Redshift does not do COUNT(DISTINCT)) windowing. What can I do to get the following output?

Date      | MAU
2017-1-1  | 3
2017-1-2  | 4    <- We don't want to count user_id 2 twice.
...       | ..
...       | ..
...       | ..
2017-2-1  | ..
2017-2-2  | ..
...       | ..
...       | ..

I attempted to do this (and clearly failed). Here's my code:

SELECT event_date
    ,sum(user_count) mau_count
    ,CASE
        WHEN event_date = date_trunc('week', event_date)
            THEN 1
        ELSE 0
        END week_starting FROM (
    SELECT event_date
        ,count(*) OVER (PARTITION BY event_date ORDER BY event_date ROWS BETWEEN 30 PRECEDING
                    AND CURRENT ROW
            ) AS user_count    <-- I know this is wrong. Just my attempt :)
    FROM (
        SELECT DISTINCT (user_id)
            ,event_date
        FROM event_table
        ) daily_distinct_users
    GROUP BY event_date
    ) cumulative_daily_distinct_users GROUP BY event_date;

Please let me know how I can get the MAU count accurately. Thanks!


回答1:


This one seems to work (column names in the log table are dt and userid):

SELECT
  end_date,
  -- The number of distinct users during the 30 days prior
  COUNT(DISTINCT userid) distinct_users
FROM log
JOIN
( -- A list of dates to appear in the output first column
  SELECT DISTINCT dt AS end_date
  FROM log
  WHERE dt BETWEEN date '2017-01-01' AND date '2017-01-31'
) ON dt BETWEEN end_date - interval '30 days' AND end_date
GROUP BY end_date
ORDER BY end_date

Basically, the sub-select generates a list of end_dates that appear as the first output column. Then, it joins to the distinct number of userid that appear during the 30 days prior to the selected date.




回答2:


Assuming there will be no missing dates, you can first get the first date a user appeared on using MIN function. Then get the user count for each date and then use the SUM function to get the rolling sum.

SELECT DISTINCT EVENT_DATE,
SUM(CNT) OVER(ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS MAU
FROM
 (SELECT E.EVENT_DATE,
         COUNT(DISTINCT T.USER_ID) AS CNT
  FROM EVENT_TABLE E
  LEFT JOIN
   (SELECT DISTINCT USER_ID,
     MIN(EVENT_DATE) OVER(PARTITION BY USER_ID
                          ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS FIRST_APPEARED_ON
    FROM EVENT_TABLE 
   ) T ON T.FIRST_APPEARED_ON=E.EVENT_DATE AND T.USER_ID=E.USER_ID
  GROUP BY E.EVENT_DATE
) T1

Sample Demo using SQL Server




回答3:


@John Rotenstein's answer works well.

For those who stumble across this question and are looking for something a little more, the following blog post describes an alternative precomputation strategy for computing rolling MAUs quickly. It's overkill for the question here but might come in handy in case you:

  • are exasperated with the slow speed of growth metric calculations for interactive queries,
  • need to compute other rolling growth metrics (e.g., registrations, activation, retention, reactivation), or
  • regularly perform analyses that involve some type of rolling user count.


来源:https://stackoverflow.com/questions/42261489/redshift-calculate-monthly-active-users

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!