Hive: Sum over a specified group (HiveQL)

五迷三道 提交于 2020-12-28 07:45:40

问题


I have a table:

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

I would like to find the sum of all products for each group of "key" and append to each row. For example for key = 1, find the sum of costs of all products (20+10+5=35) and then append result to all rows which correspond to the key = 1. So end result:

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

I would prefer to do this without using a sub-join as this would be inefficient. My best idea would be to use the over function in conjunction with the sum function but I cant get it to work. My best try:

SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;

Iv had a look at the docs but there so cryptic I have no idea how to work out how to do it. Im using Hive v0.12.0, HDP v2.0.6, HortonWorks Hadoop distribution.


回答1:


Similar to @VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement.

The HiveQL query is therefore:

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;



回答2:


You could use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to achieve that without a self join.

Code as below:

SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;



回答3:


The analytics function sum gives cumulative sums. For example, if you did:

select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test

then you would get:

key    product_code    cost     total_costs
1      UK              20       20
1      US              10       30
1      EU              5        35
2      UK              3        3
2      EU              6        9

which, it seems, is not what you want.

Instead, you should use the aggregation function sum, combined with a self join to accomplish this:

select test.key, test.product_code, test.cost, agg.total_cost
from (
  select key, sum(cost) as total_cost
  from test
  group by key
) agg
join test
on agg.key = test.key;



回答4:


This query gives me perfect result

select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;




回答5:


similar answer (if we use oracle emp table):

select deptno, ename, sal, sum(sal) over(partition by deptno) from emp;

output will be like below:

deptno  ename   sal sum_window_0
10  MILLER  1300    8750
10  KING    5000    8750
10  CLARK   2450    8750
20  SCOTT   3000    10875
20  FORD    3000    10875
20  ADAMS   1100    10875
20  JONES   2975    10875
20  SMITH   800     10875
30  BLAKE   2850    9400
30  MARTIN  1250    9400
30  ALLEN   1600    9400
30  WARD    1250    9400
30  TURNER  1500    9400
30  JAMES   950     9400



回答6:


The table above looked like

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

The user wanted a tabel with the total costs like the following

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

Therefor we used the following query

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

So far so good. I want a column more, counting the occurences of each country

key    product_code    cost     total_costs     occurences
1      UK              20       35              2
1      US              10       35              1
1      EU              5        35              2
2      UK              3        9               2
2      EU              6        9               2

Therefor I used the following query

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as total_costs
COUNT(product code) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as occurences
FROM test;

Sadly this is not working. I get an cryptic error. To exclude an error in my query I want to ask if I did something wrong. Thanks



来源:https://stackoverflow.com/questions/25082057/hive-sum-over-a-specified-group-hiveql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!