问题
In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?
For example, (Updated per @inquisitive_mind's tip)
Input: a line-separated file with one value per line my_codes.txt
'110'
'100'
'000'
sample_data.txt
'110', 2
'110', 3
'001', 3
'000', 1
Desired Output
'110', 2
'110', 3
'000', 1
Sample script
%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);
Error:
Scalar has more than one row in the output. 1st : ('110'), 2nd :('100')
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
I had also tried FILTER sample_data BY code IN my_codes;
but the "IN" clause seems to require parenthesis.
I also tried FILTER sample_data BY code IN (my_codes);
but got the error:
A column needs to be projected from a relation for it to be used as a scalar
回答1:
The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below
'110'
'100'
'000'
Alternatively,you can use JOIN
joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;
来源:https://stackoverflow.com/questions/44532143/pig-efficient-filtering-by-loaded-list