Regex Optimization for large lists

一笑奈何 提交于 2019-12-29 09:21:35

问题


I am comparing two lists of strings to find possible matches. Example:

public class Tester {

    public static void main(String[] args) {

        List<String> test = new ArrayList<String>();
        List<String> test2 = new ArrayList<String>();

        test.add("3H0875AAAA0012");
        test.add("3H0875AABB0018");
        test.add("3H0875AAAC0010");
        test2.add("3H0875AA");


        for(String s2: test2){
            for (String s: test){
                if (s.matches(".*" + s2 + ".*")){
                    System.out.println("Match");
                }
            }
        }
    }
}

Basically for every string in test2 I want to see if there are any strings in test that contain test2 completely or partially. The output for the above code should be:

Match 
Match 
Match

However, in my real case scenario I have around 225K strings in test and around 5K strings in test2. It is taking too long process this comparison and wanted to see if it was possible to optimize the comparison. It takes about 10 minutes to analyze the first 1.5K items in test2. So it will take at least 30 to 40 minutes to finish the comparison.

Thanks in advance


回答1:


I think that you shouldn't use regex for that: I believe that looking into String#contains (here is a link to its javadoc entry) would give you better results, in terms of performance ;)

For example, your code could be:

for(final String s2: test2){
    for (final String s: test){
        if(s.contains(s2)) {
            System.out.println("Match");
        }
    }
}



回答2:


IMHO methods like String.matches(String) should be forbidden. Maybe you need a regex match, maybe not, but what happens here, is that you string gets compiled into an regex... again and again.

So do yourself a favor and convert then all into regexes via Pattern.compile and reuse them.


Looking at your ".*" + s2 + ".*", I'd bet you need no regex at all. Simply use String.contains and enjoy the speed.




回答3:


Instead of

s.matches(".*" + s2 + ".*")

you can use

s.contains(s2)

or

s.indexOf(s2) > -1

I tested both, each is about 35x faster than matches.




回答4:


You absolutely should be creating a single Matcher object in this situation, and using that single object in every loop iteration. You are currently creating a new matcher (and compiling a new Pattern) in each loop iteration.

At the top of your code, do this:

//"": Unused to-search string, so the matcher object can be reused
Matcher mtchr = Pattern.compile(".*" + s2 + ".*").matcher("");

Then in your loop, do this:

if(mtchr.reset(s).matches())  {
   ...

But I'll agree with @maaartinus here, and say that, given your requirements, you don't need regex at all, and can instead use indexOf(s), or even better, contains(s), as you don't seem to need the resulting index.

Regardless, this concept of reusing a matcher is invaluable.



来源:https://stackoverflow.com/questions/22951558/regex-optimization-for-large-lists

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!