How to use REGEX_EXTRACT_ALL in Pig

问题

This is my sample data,

subId=00001111911128052627,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212218.4621702216543667E17
subId=00001111911128052639,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212219.6726312167218586E17
subId=00001111911128052615,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212216.9431647633139046E17

My expected output will be a tuple where each field represents a matched group:

(capturing_group1, capturing_group2, ..., capturing_groupN)

e.g.(00001111911128052627,11232w34532543456345623453456984756894756,122112212212212216.9431647633139046E17)

This is my approach,

A = load '/home/hduser/Desktop/arrtest1.txt' using TextLoader as (line:chararray);
b = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[subId=](.*)[towerid=](.*)[bytes=](.*)')) AS (F1,F2,F3);

But I am not getting my result.

回答1:

Based on your input example you can try with this regex:

REGEX_EXTRACT_ALL(line,'subId=([^,]*),towerid=([^,]*),bytes=(.*)')

You can check the behaviour of this regex at this link.

Update: why not use .* to check the field?

The default greedy nature of kleene operator * cause the regex engine to matches till the end of the string, then it go back one char per time and to check if the next section of the regex matches (e.g. it searches for a comma , after the first .*).

So at the end all the regex below match but with different steps to complete the process:

[a-zA-Z]+=(.*),[a-zA-Z]+=(.*),[a-zA-Z]+=(.*) - 1142 steps

subId=([^,]*),towerid=([^,]*),bytes=(.*) - 96 steps.

If you don't care about the fields name and you want pure letters fields (uppercase or lowercase):

(?i)[a-z]+=([^,]*)[a-z,]+=([^,]*),[a-z,]+=(.*) - 58 steps

NB: the Apache Pig regex engine is based on the Java one so the case-insensitive flag (?i) is likely to works too.

来源：https://stackoverflow.com/questions/34492105/how-to-use-regex-extract-all-in-pig

标签

regex

apache-pig