问题
How do I read in a list of bags in Pig?
I tried:
grunt> cat sample.txt
{a,b},{},{c,d}
grunt> data = LOAD 'sample.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data
({},,)
回答1:
The default method for reading data into Pig is PigStorage('\t')
-- that is, it assumes your data is tab-separated. Yours is comma-separated. So you should write LOAD 'sample.txt' USING PigStorage(',') AS...
.
However, your data is not in proper Pig bag format. Remember that a bag is a collection of tuples. If you cannot pre-process your input, you'll have to write a UDF to parse input of the form you have given. So this ought to work:
grunt> cat tmp/data.txt
{(a),(b)},{},{(c),(d)}
grunt> data = LOAD 'tmp/data.txt' USING PigStorage(',') AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
(,,{})
What went wrong? The fact that your input field separator (,
) is the same as the bag-record separator is confusing Pig. It parses your input into the fields {(a)
, (b)}
, and {}
, which is why only the third field ends up being a bag. It's why you'll see a warning message like Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s)
.
If you can, try to use tabs or spaces (or semicolons, or...) instead of commas:
grunt> cat tmp/data.txt
{(a),(b)} {} {(c),(d)}
grunt> data = LOAD 'tmp/data.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
({(a),(b)},{},{(c),(d)})
来源:https://stackoverflow.com/questions/15160400/how-do-i-read-in-a-list-of-bags-in-pig