Count rows in partition with Order By

你离开我真会死。 提交于 2019-12-11 03:33:48

问题


I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.

id integer | num integer
___________|_____________
1          | 4 
2          | 4
3          | 5
4          | 6

When I run the following query, I get the output as I expected.

SELECT id, COUNT(id) OVER(PARTITION BY num) from test;

id         | count
___________|_____________
1          | 2 
2          | 2
3          | 1
4          | 1

But, when I add ORDER BY to the partition,

SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;

id         | count
___________|_____________
1          | 1 
2          | 2
3          | 1
4          | 1

My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?


回答1:


When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).

The count(*) will return the number of rows up until the "current one" based on the order specified.

The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).

with test (id, num, x) as (
  values 
    (1, 4, 1),
    (2, 4, 1),
    (3, 5, 2),
    (4, 6, 2)
)
select id, 
       num,
       x,
       count(*) over () as total_rows, 
       count(*) over (order by id) as rows_upto,
       count(*) over (partition by x order by id) as rows_per_x,
       sum(num) over (partition by x) as total_for_x,
       sum(num) over (order by id) as sum_upto,
       sum(num) over (partition by x order by id) as sum_for_x_upto
from test;

will result in:

id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
 1 |   4 | 1 |          4 |         1 |          1 |           8 |        4 |              4
 2 |   4 | 1 |          4 |         2 |          2 |           8 |        8 |              8
 3 |   5 | 2 |          4 |         3 |          1 |          11 |       13 |              5
 4 |   6 | 2 |          4 |         4 |          2 |          11 |       19 |             11

There are more examples in the Postgres manual




回答2:


Your two expressions are:

COUNT(id) OVER (PARTITION BY num)

COUNT(id) OVER (PARTITION BY num ORDER BY id)

Why would you expect these to return the same values? The syntax is different for a reason.

The first returns the overall count for each num -- essentially joining back the aggregated value.

The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.

Note that such cumulative counts would normally be implemented using RANK() (or related functions). The cumulative count is subtly different from RANK(). The cumulative count implements:

COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.



来源:https://stackoverflow.com/questions/50945477/count-rows-in-partition-with-order-by

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!