问题
I have two files, one called a-records
123^record1
222^record2
333^record3
and the other file called b-records
123^jim
123^jim
222^mike
333^joe
you can see in file A that I have the token 123 one time. In file B it's in there twice. Is there a way using Apache PIG I can join the data such that I only get ONE joined record from the A file?
here is my current script which outputs the following below
arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray);
brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray);
x = JOIN arecords BY token, brecords BY token;
dump x;
which yields:
(123,record1,123,jim)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)
when what I REALLY want is(notice token 123 is only in there once after the join)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)
any ideas? thanks so much
回答1:
I would do something like this :
arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray);
brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray);
bdistinct = DISTINCT brecords;
x = JOIN arecords BY token, bdistinct BY token;
dump x;
来源:https://stackoverflow.com/questions/7790079/how-can-i-do-this-inner-join-properly-in-apache-pig