Using Regex in Pig in hadoop

问题

I have a CSV file containing user (tweetid, tweets, userid).

396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“@BleacherReport: Halloween has given us this amazing Derrick Rose photo (via @amandakaschube, @ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03

Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.

For this I have the following code: A = load '/user/pig/tweets' as (line); B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray); C = filter B by msg matches '.*favorite.*'; D = order C by tweetid;

How does the regular expression work here in splitting the output in desired way?

I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);

the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);

(396124554353197056,"Just saw @samantha0wen and @DakotaFears at the drake concert #waddup")
(396124554172432384,"@Yutika_Diwadkar I'm just so bright 😁")

(396124554609033216,"@TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")

(396124554805776385,"@MichaelThe_Lion me too 😒")

(396124552540852226,"Happy Halloween from us 2 @maddow &amp; @Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>

Please help.

回答1:

Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.

" in the csv

” in the regex code.

To get the tweetid try this:

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1))  AS (tweetid:long);

来源：https://stackoverflow.com/questions/32089571/using-regex-in-pig-in-hadoop

标签

regex

csv

Hadoop

apache-pig