问题
The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp
field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}}
which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
回答1:
SUM operates on a bag as input, but you pass it the field 'size
'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
回答2:
Do some dumps of your data. I'll bet some of the stuff in the size
column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
回答3:
SUM
, AVG
and COUNT
are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);
来源:https://stackoverflow.com/questions/16405267/error-1045-on-sum-function-in-pig-latin-with-an-int