问题
I have a following file:
File
----
12-3 John 121
5-1 Sam 122
The file is tab(\t
) delimited. I am loading the row as line:chararray
as I want the data not to be split in individual fields.
And now, I want to pull and store the details (12-3, and 5-1) as separate data.
I am trying with STRSPLIT
and REGEX_EXTRACT_ALL
, but the data doesn't seem to match.
splitdata = FOREACH filedata {
regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)');
split = STRSPLIT(line, '\\t', 1);
GENERATE regex, split;
};
This is how I want my final data to be:
(12, 3, 12-3 John 121)
( 5, 1, 5-1 Sam 122)
回答1:
What about:
A = LOAD .... AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^(.*)\t(.*)\t(.*)$'))
AS (id:chararray, name:chararray, nameid:chararray);
C = FOREACH B GENERATE FLATTEN(REGEX_EXTRACT_ALL(id, '^([0-9]*)\\-([0-9]*)')),
id, name, nameid;
STORE C INTO ...
If you would split the lines into fields along \t when loading, you could skip B = ...
回答2:
Thanks Lorand.
Since you gave a little idea about how to use the REGEX_EXTRACT_ALL
, here is how I finally used it.
FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*).*'))
AS (FIELD1:chararray, FIELD2:chararray), line;
Pretty interesting to know that Matcher.matches() fails for '^([0-9]*)\\-([0-9]*)'
while works for '^([0-9]*)\\-([0-9]*).*'
.
来源:https://stackoverflow.com/questions/13396778/strsplit-and-regex-extract-all-in-piglatin