GROUP BY and COUNT in PostgreSQL

纵然是瞬间 提交于 2019-11-28 19:10:27

I think you just need COUNT(DISTINCT post_id) FROM votes.

See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.

EDIT: Corrected my careless mistake per Erwin's comment.

Erwin Brandstetter

There is also EXISTS:

SELECT count(*) AS post_ct
FROM   posts p
WHERE  EXISTS (SELECT 1 FROM votes v WHERE v.post_id = p.id);

Which, in PostgreSQL and with multiple entries on the n-side like you undoubtedly have, is generally faster than count(DISTINCT x):

SELECT count(DISTINCT p.id) AS post_ct
FROM   posts p
JOIN   votes v ON v.post_id = p.id;

The more rows per post there are in votes, the bigger the difference in performance. Just try it with EXPLAIN ANALYZE.

count(DISTINCT x) will collect all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.

If every post_id is guaranteed to be present in the table posts (for instance, by foreign key constraint), this short form is equivalent to the longer form:

SELECT count(DISTINCT post_id) AS post_ct
FROM   votes;

This may actually be faster than the first query with EXISTS, with no or few entries per post.

The query you had works in simpler form, too:

SELECT count(*) AS post_ct
FROM  (
    SELECT 1
    FROM   posts 
    JOIN   votes ON votes.post_id = posts.id 
    GROUP  BY posts.id
    ) x;

Benchmark

To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:

Test setup

Fake a typical post / vote situation:

CREATE SCHEMA y;
SET search_path = y;

CREATE TABLE posts (
  id   int PRIMARY KEY -- I don't use "id" as column name
, post text);

INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM   generate_series(1,10000) g;

DELETE FROM posts WHERE random() > 0.9;  -- create ~10 % dead tuples

CREATE TABLE votes (
  vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);

INSERT INTO votes (post_id, up_down)
SELECT g.* 
FROM  (
   SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven vote distribution
        , random()::int::bool AS up_down
   FROM   generate_series(1,70000)
   ) g
JOIN   posts p ON p.id = g.post_id;

All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE (best of five) on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.

  1. As is.

  2. After ..

    ANALYZE posts;
    ANALYZE votes;
    
  3. After ..

    CREATE INDEX foo on votes(post_id);
    
  4. After ..

    VACUUM FULL ANALYZE posts;
    CLUSTER votes using foo;
    

count(*) ... WHERE EXISTS

  1. 253 ms
  2. 220 ms
  3. 85 ms (seq scan on posts, index scan on votes, nested loop)
  4. 85 ms

count(DISTINCT x) - long form with join

  1. 354 ms
  2. 358 ms
  3. 373 ms (index scan on posts, index scan on votes, merge join)
  4. 330 ms

count(DISTINCT x) - short form without join

  1. 164 ms
  2. 164 ms
  3. 164 ms (always seq scan)
  4. 142 ms

Best time for original query in question:

  • 353 ms

For simplified version:

  • 348 ms

@wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:

  • 366 ms

Index-only scans in the upcoming PostgreSQL 9.2 can change the result for each of these queries.

DROP SCHEMA y CASCADE;  -- clean up

Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):

Using OVER() and LIMIT 1:

SELECT COUNT(1) OVER()
FROM posts 
   INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id
LIMIT 1;
WITH uniq AS (
        SELECT DISTINCT posts.id as post_id
        FROM posts
        JOIN votes ON votes.post_id = posts.id
        -- GROUP BY not needed anymore
        -- GROUP BY posts.id
        )
SELECT COUNT(*)
FROM uniq;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!