Is regex too slow? Real life examples where simple non-regex alternative is better

后端 未结 5 1366
粉色の甜心
粉色の甜心 2020-12-05 13:18

I\'ve seen people here made comments like \"regex is too slow!\", or \"why would you do something so simple using regex!\" (and then present a 10+ lines alternative instead)

5条回答
  •  隐瞒了意图╮
    2020-12-05 14:19

    I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.

    Java regex takes O(N) to match "(?s)^.*+$"

    This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).

    This pattern isn't very useful, of course, but consider the next problem.

    Java regex takes O(N) to match "(?s)^A.*Z$"

    Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:

     s.startsWith("A") && s.endsWith("Z")
    

    Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.

    Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"

    This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).

    I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).

    In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.


    Test harness

    static void testAnchors() {
        String pattern = "(?s)^.*[aeiou]{3}$";
        for (int N = 1; N < 20; N++) {
            String needle = stringLength(1 << N) + "ooo";
            System.out.println(N);
            boolean b = true;
            for (int REPS = 10000; REPS --> 0; ) {
                b &= 
                  needle
                  //.substring(needle.length() - 3) // try with this
                  .matches(pattern);
            }
            System.out.println(b);
        }
    }
    

    The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).


    Conclusion

    It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.

    Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.

    //update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).

    I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).

    Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.

    So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.

提交回复
热议问题