How to use REGEX_EXTRACT_ALL in Pig

末鹿安然 提交于 2019-12-11 11:57:49

问题


This is my sample data,

subId=00001111911128052627,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212218.4621702216543667E17
subId=00001111911128052639,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212219.6726312167218586E17
subId=00001111911128052615,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212216.9431647633139046E17

My expected output will be a tuple where each field represents a matched group:

(capturing_group1, capturing_group2, ..., capturing_groupN)

e.g.(00001111911128052627,11232w34532543456345623453456984756894756,122112212212212216.9431647633139046E17)

This is my approach,

A = load '/home/hduser/Desktop/arrtest1.txt' using TextLoader as (line:chararray);
b = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[subId=](.*)[towerid=](.*)[bytes=](.*)')) AS (F1,F2,F3);

But I am not getting my result.


回答1:


Based on your input example you can try with this regex:

REGEX_EXTRACT_ALL(line,'subId=([^,]*),towerid=([^,]*),bytes=(.*)')

You can check the behaviour of this regex at this link.

Update: why not use .* to check the field?

The default greedy nature of kleene operator * cause the regex engine to matches till the end of the string, then it go back one char per time and to check if the next section of the regex matches (e.g. it searches for a comma , after the first .*).

So at the end all the regex below match but with different steps to complete the process:

[a-zA-Z]+=(.*),[a-zA-Z]+=(.*),[a-zA-Z]+=(.*) - 1142 steps

subId=([^,]*),towerid=([^,]*),bytes=(.*) - 96 steps.

If you don't care about the fields name and you want pure letters fields (uppercase or lowercase):

(?i)[a-z]+=([^,]*)[a-z,]+=([^,]*),[a-z,]+=(.*) - 58 steps

NB: the Apache Pig regex engine is based on the Java one so the case-insensitive flag (?i) is likely to works too.



来源:https://stackoverflow.com/questions/34492105/how-to-use-regex-extract-all-in-pig

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!