Apache Pig not parsing a tuple fully

不打扰是莪最后的温柔 提交于 2019-12-23 01:45:10

问题


I have a file called data that looks like this: (note there are tabs after the 'personA')

personA (1, 2, 3)
personB (2, 1, 34)

And I have an Apache pig script like this:

A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;

The output of which makes sense:

(1)
(2)

However if I change the schema of the script to be like this:

A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;

Then the output I get is this:

(1, 2, 3)
(2, 1, 34)

It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.

In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….

Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?


回答1:


Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:

  • Reformat your file to be tab delimited: personA 1 2 3
  • Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
  • Write your own custom loader.



回答2:


This is no more a limitation. Pig parses the tuples in input file considering comma as field separator. I'm trying in Apache Pig version 0.15.0.

A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;

Output I get is:

(1)
(2)



回答3:


Here is another way of tackling this issue, although I know the answers above are more efficient.

data = LOAD 'data' USING PigStorage() AS (name:chararray, field2:chararray);

data = FOREACH data GENERATE name, REPLACE(REPLACE(field2, '\\(',''),'\\)','') AS field2;  

data = FOREACH data GENERATE name, STRSPLIT(field2, '\\,') AS fieldTuple;

data = FOREACH data GENERATE name, fieldTuple.$0,fieldTuple.$1, fieldTuple.$2 ;
  1. Load field2 as chararray
  2. Remove parentheses
  3. Split field2 by comma (it gives you a tuple with 3 fields in it)
  4. Get values by index

I know it is hacky. Just wanted to provide another way of doing this



来源:https://stackoverflow.com/questions/8838408/apache-pig-not-parsing-a-tuple-fully

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!