Java or Pig regex to strip out values from UserAgent string

被刻印的时光 ゝ 提交于 2019-12-11 02:39:47

问题


I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.

In order to get

Mozilla/4.0 (compatible; MSIE 8.0)

from

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)

I successfully use sed command

 sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'

I need to get the same result in Apache Pig with a Java regex. Could anybody help me to re-write the above sed regular expression into Java?

Something like:

new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);

回答1:


I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's replaceAll() method. Try this:

REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')

That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.




回答2:


In java if you use the Matcher class you can extract the capturing group. The following appears to do what you want, at least for the test case you provided.

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

    public static void main(String[] args){
        String str = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)";
        //str = "aaa";
        Pattern pat = Pattern.compile("(.*\\(.*?;.*?;).*\\)");
        Matcher m = pat.matcher(str);
        System.out.println(m.lookingAt());
        String group = m.group(1) + ")";
        System.out.println(group);
    }
 }

Hmm... I seemed to have answered the wrong question, since you were asking how to do this from 'PIG' not straight JAVA.




回答3:


As none of two suggested solutions seems to work in PIG I will post workaround which uses sed through stream:

user_agent_mangled = STREAM logs THROUGH `sed 's/(\\([^;]\\+; [^;]\\+\\)[^)]*)/(\\1)/'`;

This works well, however I would still prefer native PIG solution (using EXTRACT or REPLACE function).



来源:https://stackoverflow.com/questions/8236482/java-or-pig-regex-to-strip-out-values-from-useragent-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!