How do I read in a list of bags in Pig?

拥有回忆 提交于 2019-12-23 19:06:14

问题


How do I read in a list of bags in Pig?

I tried:

grunt> cat sample.txt
{a,b},{},{c,d}
grunt> data = LOAD 'sample.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data
({},,)

回答1:


The default method for reading data into Pig is PigStorage('\t') -- that is, it assumes your data is tab-separated. Yours is comma-separated. So you should write LOAD 'sample.txt' USING PigStorage(',') AS....

However, your data is not in proper Pig bag format. Remember that a bag is a collection of tuples. If you cannot pre-process your input, you'll have to write a UDF to parse input of the form you have given. So this ought to work:

grunt> cat tmp/data.txt
{(a),(b)},{},{(c),(d)}
grunt> data = LOAD 'tmp/data.txt' USING PigStorage(',') AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
(,,{})

What went wrong? The fact that your input field separator (,) is the same as the bag-record separator is confusing Pig. It parses your input into the fields {(a), (b)}, and {}, which is why only the third field ends up being a bag. It's why you'll see a warning message like Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s).

If you can, try to use tabs or spaces (or semicolons, or...) instead of commas:

grunt> cat tmp/data.txt                                                                
{(a),(b)}       {}      {(c),(d)}
grunt> data = LOAD 'tmp/data.txt' AS (a:bag{}, b:bag{}, c:bag{});                      
grunt> DUMP data;
({(a),(b)},{},{(c),(d)})


来源:https://stackoverflow.com/questions/15160400/how-do-i-read-in-a-list-of-bags-in-pig

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!