Pig: efficient filtering by loaded list

旧城冷巷雨未停 提交于 2019-12-08 08:17:31

问题


In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?

For example, (Updated per @inquisitive_mind's tip)

Input: a line-separated file with one value per line my_codes.txt

'110'
'100'
'000'

sample_data.txt

'110', 2
'110', 3
'001', 3
'000', 1

Desired Output

'110', 2
'110', 3
'000', 1

Sample script

%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);

Error:

Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') 
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

I had also tried FILTER sample_data BY code IN my_codes; but the "IN" clause seems to require parenthesis. I also tried FILTER sample_data BY code IN (my_codes); but got the error: A column needs to be projected from a relation for it to be used as a scalar


回答1:


The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below

'110'
'100'
'000'

Alternatively,you can use JOIN

joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;


来源:https://stackoverflow.com/questions/44532143/pig-efficient-filtering-by-loaded-list

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!