Regex Optimization for large lists

后端 未结 4 443
逝去的感伤
逝去的感伤 2020-12-21 22:26

I am comparing two lists of strings to find possible matches. Example:

public class Tester {

    public static void main(String[] args) {

        List

        
相关标签:
4条回答
  • 2020-12-21 23:03

    Instead of

    s.matches(".*" + s2 + ".*")
    

    you can use

    s.contains(s2)
    

    or

    s.indexOf(s2) > -1
    

    I tested both, each is about 35x faster than matches.

    0 讨论(0)
  • 2020-12-21 23:05

    I think that you shouldn't use regex for that: I believe that looking into String#contains (here is a link to its javadoc entry) would give you better results, in terms of performance ;)

    For example, your code could be:

    for(final String s2: test2){
        for (final String s: test){
            if(s.contains(s2)) {
                System.out.println("Match");
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-21 23:10

    You absolutely should be creating a single Matcher object in this situation, and using that single object in every loop iteration. You are currently creating a new matcher (and compiling a new Pattern) in each loop iteration.

    At the top of your code, do this:

    //"": Unused to-search string, so the matcher object can be reused
    Matcher mtchr = Pattern.compile(".*" + s2 + ".*").matcher("");
    

    Then in your loop, do this:

    if(mtchr.reset(s).matches())  {
       ...
    

    But I'll agree with @maaartinus here, and say that, given your requirements, you don't need regex at all, and can instead use indexOf(s), or even better, contains(s), as you don't seem to need the resulting index.

    Regardless, this concept of reusing a matcher is invaluable.

    0 讨论(0)
  • 2020-12-21 23:14

    IMHO methods like String.matches(String) should be forbidden. Maybe you need a regex match, maybe not, but what happens here, is that you string gets compiled into an regex... again and again.

    So do yourself a favor and convert then all into regexes via Pattern.compile and reuse them.


    Looking at your ".*" + s2 + ".*", I'd bet you need no regex at all. Simply use String.contains and enjoy the speed.

    0 讨论(0)
提交回复
热议问题