Count cumulative total in Postgresql

℡╲_俬逩灬. 提交于 2019-11-27 10:43:57

With larger datasets, window functions are the most efficient way to perform these kinds of queries -- the table will be scanned only once, instead of once for each date, like a self-join would do. It also looks a lot simpler. :) PostgreSQL 8.4 and up have support for window functions.

This is what it looks like:

SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM subscriptions
GROUP BY created_at;

Here OVER creates the window; ORDER BY created_at means that it has to sum up the counts in created_at order.


Edit: If you want to remove duplicate emails within a single day, you can use sum(count(distinct email)). Unfortunately this won't remove duplicates that cross different dates.

If you want to remove all duplicates, I think the easiest is to use a subquery and DISTINCT ON. This will attribute emails to their earliest date (because I'm sorting by created_at in ascending order, it'll choose the earliest one):

SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM (
    SELECT DISTINCT ON (email) created_at, email
    FROM subscriptions ORDER BY email, created_at
) AS subq
GROUP BY created_at;

If you create an index on (email, created_at), this query shouldn't be too slow either.


(If you want to test, this is how I created the sample dataset)

create table subscriptions as
   select date '2000-04-04' + (i/10000)::int as created_at,
          'foofoobar@foobar.com' || (i%700000)::text as email
   from generate_series(1,1000000) i;
create index on subscriptions (email, created_at);

Use:

SELECT a.created_at,
       (SELECT COUNT(b.email)
          FROM SUBSCRIPTIONS b
         WHERE b.created_at <= a.created_at) AS count
  FROM SUBSCRIPTIONS a
SELECT
  s1.created_at,
  COUNT(s2.email) AS cumul_count
FROM subscriptions s1
  INNER JOIN subscriptions s2 ON s1.created_at >= s2.created_at
GROUP BY s1.created_at

I assume you want only one row per day and you want to still show days without any subscriptions (suppose nobody subscribes for a certain date, do you want to show that date with the balance of the previous day?). If this is the case, you can use the 'with' feature:

with recursive serialdates(adate) as (
    select cast('2011-04-04' as date)
    union all
    select adate + 1 from serialdates where adate < cast('2011-04-07' as date)
)
select D.adate,
(
    select count(distinct email)
    from subscriptions
    where created_at between date_trunc('month', D.adate) and D.adate
)
from serialdates D

The best way is to have a calendar table: calendar ( date date, month int, quarter int, half int, week int, year int )

Then, you can join this table to make summary for the field you need.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!