java regular expressions: performance and alternative

后端未结

关注

 2  789

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear

相关标签:

2条回答

爱一瞬间的悲伤

2020-12-13 00:51
Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.

So you should compile your regular expression only once and reuse it:
```
Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
    Matcher matcher = pattern.matcher(value);
    if (matcher.matches()) {
        // your code here
    }
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

夕颜

2020-12-13 01:08

Consider the following (quick and dirty) test:

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test3 {

    // time that tick() was called
    static long tickTime;

    // called at start of operation, for timing
    static void tick () {
        tickTime = System.nanoTime();
    }

    // called at end of operation, prints message and time since tick().
    static void tock (String action) {
        long mstime = (System.nanoTime() - tickTime) / 1000000;
        System.out.println(action + ": " + mstime + "ms");
    }

    // generate random strings of form AAAABBBCCCCC; a random 
    // number of characters each randomly repeated.
    static List<String> generateData (int itemCount) {

        Random random = new Random();
        List<String> items = new ArrayList<String>();
        long mean = 0;

        for (int n = 0; n < itemCount; ++ n) {
            StringBuilder s = new StringBuilder();
            int characters = random.nextInt(7) + 1;
            for (int k = 0; k < characters; ++ k) {
                char c = (char)(random.nextInt('Z' - 'A') + 'A');
                int rep = random.nextInt(95) + 5;
                for (int j = 0; j < rep; ++ j)
                    s.append(c);
                mean += rep;
            }
            items.add(s.toString());
        }

        mean /= itemCount;
        System.out.println("generated data, average length: " + mean);

        return items;

    }

    // match all strings in items to regexStr, do not precompile.
    static void regexTestUncompiled (List<String> items, String regexStr) {

        tick();

        int matched = 0, unmatched = 0;

        for (String item:items) {
            if (item.matches(regexStr))
                ++ matched;
            else
                ++ unmatched;
        }

        tock("uncompiled: regex=" + regexStr + " matched=" + matched + 
             " unmatched=" + unmatched);

    }

    // match all strings in items to regexStr, precompile.
    static void regexTestCompiled (List<String> items, String regexStr) {

        tick();

        Matcher matcher = Pattern.compile(regexStr).matcher("");
        int matched = 0, unmatched = 0;

        for (String item:items) {
            if (matcher.reset(item).matches())
                ++ matched;
            else
                ++ unmatched;
        }

        tock("compiled: regex=" + regexStr + " matched=" + matched + 
             " unmatched=" + unmatched);

    }

    // test all strings in items against regexStr.
    static void regexTest (List<String> items, String regexStr) {

        regexTestUncompiled(items, regexStr);
        regexTestCompiled(items, regexStr);

    }

    // generate data and run some basic tests
    public static void main (String[] args) {

        List<String> items = generateData(1000000);
        regexTest(items, "A*");
        regexTest(items, "A*B*C*");
        regexTest(items, "E*C*W*F*");

    }

}

Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.

Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:

Regex      Uncompiled    Precompiled
A*          0.564 sec     0.126 sec
A*B*C*      1.768 sec     0.238 sec
E*C*W*F*    0.795 sec     0.275 sec

Actual output:

generated data, average length: 208
uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
compiled: regex=A* matched=6004 unmatched=993996: 126ms
uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms

Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.

After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.

For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

0 讨论(0)