Calculating percentage in a pig query

Deadly 提交于 2019-12-13 02:13:47

问题


  • I have a table with two columns (col1:string, col2:boolean)
  • Lets say col1 = "aaa"
  • For col1 = "aaa", there are many True/False values of col2
  • I want to calculate the percentage of True values for col1 (aaa)

INPUT:

aaa T
aaa F
aaa F
bbb T
bbb T
ccc F
ccc F

OUTPUT

COL1   TOTAL_ROWS_IN_INPUT_TABLE   PERCENTAGE_TRUE_IN_INPUT_TABLE
aaa     3                          33%
bbb     2                          100%
ccc     2                          0%

How would I do this using PIG (LATIN)?


回答1:


In Pig 0.10 SUM(INPUT.col2) does not work and casting to boolean is not possible as it treats INPUT.col2 as a bag of boolean and bag is not a primitive type. Another thing is that if the input data for col2 is specified as boolean, than dump of the input does not have any values for the col2, but treating it as a chararray works just fine.

Pig is well suited for this type of tasks as it has means to work with individual groups by using operators nested in a FOREACH. Here is the solution which works:

inpt = load '....' as (col1 : chararray, col2 : chararray);
grp = group inpt by col1; -- creates bags for each value in col1
result = foreach grp {
    total = COUNT(inpt);
    t = filter inpt by col2 == 'T'; --create a bag which contains only T values
    generate flatten(group) as col1, total as  TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total as PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

dump result;

Output:

(aaa,3,33.333333333333336)
(bbb,2,100.0)
(ccc,2,0.0)



回答2:


When you COUNT the number of records for each key in col1, you should count the number of true values at the same time. This way the entire thing takes place in one MapReduce job.

grouped = group INPUT by col1
OUTPUT = foreach grouped generate group, COUNT(INPUT), (double)SUM(INPUT.col2)/COUNT(INPUT);

I am stuck with Pig 0.9 on a legacy system, so I am not familiar with the new boolean type. If it is possible to SUM over booleans, then that should be sufficient. Otherwise, you will need to translate the booleans into 1s and 0s with a simple foreach/generate first.



来源:https://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!