Too many filter matching in pig

孤者浪人 提交于 2019-12-08 07:20:40

问题


I have a list of filter keywords (about 1000 in numbers) and I need to filter a field of a relation in pig using this list.

Initially, I have declared these keywords like: %declare p1 '.keyword1.'; .... ...

%declare p1000 '.keyword1000.';

I am then doing filtering like:

Filtered= FITLER SRC BY (not $0 matches '$p1') and (not $0 matches '$p2') and ...... (not $0 matches '$p1000');

DUMP Filtered;

Assume that my source relation is in SRC and I need to apply filtering on first field i.e. $0.

If I am reducing the number of filters to 100-200, it's working fine. But as number of filters increases to 1000. It doesn't work.

Can somebody suggest a work around to get the results right?

Thanks in advance


回答1:


You can write a simple filter UDF where you'd perform all the checks something like:

 package myudfs;
 import java.io.IOException;
 import org.apache.pig.FilterFunc;
 import org.apache.pig.data.Tuple;

 public class MYFILTER extends FilterFunc
 {
    static List<String> filterList;
    static MYFILTER(){
        //load all filters
    }
    public Boolean exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return null;
        try{
            String str = (String)input.get(0);
           return !filterList.contains(str);
        }catch(Exception e){
            throw new IOException("Caught exception processing input row ", e);
        }
    }

  }



回答2:


One shallow approach is to divide the filtration into stages. Filter keywords 1 to 100 in stage one and then filter another 100 and so on for a total of (count(keywords)/100) stages. However, given more details of your data, there is probably a better solution to this.

As for the above shallow solution, you can wrap the pig script in a shell script that does the parcelling out of input and starts the run on the current keyword subset being filtered.



来源:https://stackoverflow.com/questions/10349618/too-many-filter-matching-in-pig

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!