Querying DAU/MAU over time (daily)

问题

I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:

Date         MAU      DAU     DAU/MAU
2014-06-01   20,000   5,000   20%
2014-06-02   21,000   4,000   19%
2014-06-03   20,050   3,050   17%
...          ...      ...     ...

Calculating daily actives is straightforward to calculate, but calculating the monthly actives e.g. the number of users that logged in the date-30 days, is causing problems. How is this achieved without a left join for each day?

Edit: I'm using Postgres.

回答1:

Assuming you have values for each day, you can get the total counts using a subquery and range between:

with dau as (
      select date, count(userid) as dau
      from dailysessions ds
      group by date
     )
select date, dau,
       sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;

Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.

I think you have to do some sort of self join for this. Here is one method:

with dau as (
      select date, count(distinct userid) as dau
      from dailysessions ds
      group by date
     )
select date, dau,
       (select count(distinct user_id)
        from dailysessions ds
        where ds.date between date - 29 * interval '1 day' and date
       ) as mau
from dau;

回答2:

This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:

(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)

SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
  SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
  FROM (
    SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
    FROM [fh-bigquery:reddit_comments.2015_09]
    WHERE subreddit='AskReddit') a
  JOIN (
    SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
    FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
    CROSS JOIN (
      SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
      FROM [fh-bigquery:reddit_comments.2015_09]
      GROUP BY 1
    ) b
    WHERE subreddit='AskReddit'
    AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
    GROUP BY 1
  ) b
  ON a.day=b.stopday
  GROUP BY 1
)
ORDER BY 1

I went further at How to calculate DAU/MAU with BigQuery (engagement)

回答3:

I've written about this on my blog.

The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:

CREATE OR REPLACE VIEW "vw_login" AS 
 SELECT *
    , LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
    , CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
    , CASE
 WHEN LEAD("date") OVER w IS NULL THEN true
 WHEN LEAD("date") OVER w - "date" > 30 THEN true
 ELSE false
 END AS "churned"
    , CASE
 WHEN LAG("date") OVER w IS NULL THEN false
 WHEN "date" - LAG("date") OVER w <= 30 THEN false
 WHEN row_number() OVER w > 1 THEN true
 ELSE false
 END AS "resurrected"
   FROM "login"
   WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")

This creates boolean values per user per day when they become active, when they churn and when they re-activate.

Then do a daily aggregate of the same:

CREATE OR REPLACE VIEW "vw_activity" AS
SELECT 
    SUM("activated"::int) "activated"
  , SUM("churned"::int) "churned"
  , SUM("resurrected"::int) "resurrected"
  , "date"
  FROM "vw_login"
  GROUP BY "date"
  ;

And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).

I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.

SELECT
 d."date"
 , SUM(COALESCE(a.activated::int,0)
   - COALESCE(a2.churned::int,0)
   + COALESCE(a.resurrected::int,0)) OVER w
 , d."date", a."activated", a2."churned", a."resurrected" FROM
 generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
 LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
 LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
 WINDOW w AS (ORDER BY d."date") ORDER BY d."date";

You can of course do this in a single query, but this helps understand the structure better.

回答4:

You didn't show us your complete table definition, but maybe something like this:

select date,
       count(*) over (partition by date_trunc('day', date) order by date) as dau,
       count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;

To get the percentage without repeating the window functions, just wrap this in a derived table:

select date, 
       dau,
       mau,
       dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
    select date,
           count(*) over (partition by date_trunc('day', date) order by date) as dau,
           count(*) over (partition by date_trunc('month', date) order by date) as mau
    from sessions
) t
order by date;

Here is an example output:

postgres=> select * from sessions;
 session_date | user_id
--------------+---------
 2014-05-01   |       1
 2014-05-01   |       2
 2014-05-01   |       3
 2014-05-02   |       1
 2014-05-02   |       2
 2014-05-02   |       3
 2014-05-02   |       4
 2014-05-02   |       5
 2014-06-01   |       1
 2014-06-01   |       2
 2014-06-01   |       3
 2014-06-02   |       1
 2014-06-02   |       2
 2014-06-02   |       3
 2014-06-02   |       4
 2014-06-03   |       1
 2014-06-03   |       2
 2014-06-03   |       3
 2014-06-03   |       4
 2014-06-03   |       5
(20 rows)

postgres=> select session_date,
postgres->        dau,
postgres->        mau,
postgres->        round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(>     select session_date,
postgres(>            count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(>            count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(>     from sessions
postgres(> ) t
postgres-> order by session_date;
 session_date | dau | mau | pct
--------------+-----+-----+------
 2014-05-01   |   3 |   3 | 1.00
 2014-05-01   |   3 |   3 | 1.00
 2014-05-01   |   3 |   3 | 1.00
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-06-01   |   3 |   3 | 1.00
 2014-06-01   |   3 |   3 | 1.00
 2014-06-01   |   3 |   3 | 1.00
 2014-06-02   |   4 |   7 | 0.57
 2014-06-02   |   4 |   7 | 0.57
 2014-06-02   |   4 |   7 | 0.57
 2014-06-02   |   4 |   7 | 0.57
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
(20 rows)

postgres=>

来源：https://stackoverflow.com/questions/24494373/querying-dau-mau-over-time-daily

标签

sql

postgresql