Pig Latin split columns to rows

青春壹個敷衍的年華 提交于 2020-01-30 12:00:26

问题


Is there any solution in Pig latin to transform columns to rows to get the below?

Input:

id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6

required output:

id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6

thanks


回答1:


I'm willing to bet this is not the best way to do this however ...

data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray, 
       col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;

(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)

store final into '~/some_dir' using PigStorage('|');

EDIT: I really like this question and was discussing it with a co-worker and he came up with a much simpler and more elegant solution. If you have Jython installed ...

#  create file called udf.py

@outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
    c1 = column1.split(',')
    c2 = column2.split(',')
    innerBag = zip(c1, c2)
    return innerBag

Then in Pig

$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
       column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')


来源:https://stackoverflow.com/questions/23884651/pig-latin-split-columns-to-rows

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!