STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

假装没事ソ 提交于 2019-12-12 19:22:58

问题


I have a following file:

File
----
12-3    John    121
 5-1    Sam     122

The file is tab(\t) delimited. I am loading the row as line:chararray as I want the data not to be split in individual fields.

And now, I want to pull and store the details (12-3, and 5-1) as separate data.

I am trying with STRSPLIT and REGEX_EXTRACT_ALL, but the data doesn't seem to match.

splitdata = FOREACH filedata {
    regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)');
    split = STRSPLIT(line, '\\t', 1);
    GENERATE regex, split;
};

This is how I want my final data to be:

(12, 3, 12-3    John    121)
( 5, 1,  5-1    Sam     122)

回答1:


What about:

A = LOAD .... AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^(.*)\t(.*)\t(.*)$')) 
      AS (id:chararray, name:chararray, nameid:chararray);
C = FOREACH B GENERATE FLATTEN(REGEX_EXTRACT_ALL(id, '^([0-9]*)\\-([0-9]*)')), 
      id, name, nameid;
STORE C INTO ...

If you would split the lines into fields along \t when loading, you could skip B = ...




回答2:


Thanks Lorand.

Since you gave a little idea about how to use the REGEX_EXTRACT_ALL, here is how I finally used it.

FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*).*')) 
  AS (FIELD1:chararray, FIELD2:chararray), line;

Pretty interesting to know that Matcher.matches() fails for '^([0-9]*)\\-([0-9]*)' while works for '^([0-9]*)\\-([0-9]*).*'.



来源:https://stackoverflow.com/questions/13396778/strsplit-and-regex-extract-all-in-piglatin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!