java regular expressions: performance and alternative

后端 未结 2 762
时光说笑
时光说笑 2020-12-13 00:20

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear

相关标签:
2条回答
  • 2020-12-13 00:51

    Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.

    So you should compile your regular expression only once and reuse it:

    Pattern pattern = Pattern.compile(regexPattern);
    for(String value : values) {
        Matcher matcher = pattern.matcher(value);
        if (matcher.matches()) {
            // your code here
        }
    }
    
    0 讨论(0)
  • 2020-12-13 01:08

    Consider the following (quick and dirty) test:

    import java.util.ArrayList;
    import java.util.List;
    import java.util.Random;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Test3 {
    
        // time that tick() was called
        static long tickTime;
    
        // called at start of operation, for timing
        static void tick () {
            tickTime = System.nanoTime();
        }
    
        // called at end of operation, prints message and time since tick().
        static void tock (String action) {
            long mstime = (System.nanoTime() - tickTime) / 1000000;
            System.out.println(action + ": " + mstime + "ms");
        }
    
        // generate random strings of form AAAABBBCCCCC; a random 
        // number of characters each randomly repeated.
        static List<String> generateData (int itemCount) {
    
            Random random = new Random();
            List<String> items = new ArrayList<String>();
            long mean = 0;
    
            for (int n = 0; n < itemCount; ++ n) {
                StringBuilder s = new StringBuilder();
                int characters = random.nextInt(7) + 1;
                for (int k = 0; k < characters; ++ k) {
                    char c = (char)(random.nextInt('Z' - 'A') + 'A');
                    int rep = random.nextInt(95) + 5;
                    for (int j = 0; j < rep; ++ j)
                        s.append(c);
                    mean += rep;
                }
                items.add(s.toString());
            }
    
            mean /= itemCount;
            System.out.println("generated data, average length: " + mean);
    
            return items;
    
        }
    
        // match all strings in items to regexStr, do not precompile.
        static void regexTestUncompiled (List<String> items, String regexStr) {
    
            tick();
    
            int matched = 0, unmatched = 0;
    
            for (String item:items) {
                if (item.matches(regexStr))
                    ++ matched;
                else
                    ++ unmatched;
            }
    
            tock("uncompiled: regex=" + regexStr + " matched=" + matched + 
                 " unmatched=" + unmatched);
    
        }
    
        // match all strings in items to regexStr, precompile.
        static void regexTestCompiled (List<String> items, String regexStr) {
    
            tick();
    
            Matcher matcher = Pattern.compile(regexStr).matcher("");
            int matched = 0, unmatched = 0;
    
            for (String item:items) {
                if (matcher.reset(item).matches())
                    ++ matched;
                else
                    ++ unmatched;
            }
    
            tock("compiled: regex=" + regexStr + " matched=" + matched + 
                 " unmatched=" + unmatched);
    
        }
    
        // test all strings in items against regexStr.
        static void regexTest (List<String> items, String regexStr) {
    
            regexTestUncompiled(items, regexStr);
            regexTestCompiled(items, regexStr);
    
        }
    
        // generate data and run some basic tests
        public static void main (String[] args) {
    
            List<String> items = generateData(1000000);
            regexTest(items, "A*");
            regexTest(items, "A*B*C*");
            regexTest(items, "E*C*W*F*");
    
        }
    
    }
    

    Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.

    Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:

    Regex      Uncompiled    Precompiled
    A*          0.564 sec     0.126 sec
    A*B*C*      1.768 sec     0.238 sec
    E*C*W*F*    0.795 sec     0.275 sec
    

    Actual output:

    generated data, average length: 208
    uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
    compiled: regex=A* matched=6004 unmatched=993996: 126ms
    uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
    compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
    uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
    compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms
    

    Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.

    After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.

    For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

    0 讨论(0)
提交回复
热议问题