regular expression goes into infinite loop

前端 未结 2 1277
無奈伤痛
無奈伤痛 2020-12-18 07:52

I am parsing (species) names of the form:

Parus Ater
H. sapiens
T. rex
Tyr. rex

which normally have two terms (binomial) but sometimes have

相关标签:
2条回答
  • 2020-12-18 08:38

    To address the first part of your question, you should read up on catastrophic backtracking. Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work.

    In your case, it was probably the nested repitition: (\s*[a-z]+)* Which likely caused some very very strange loops. As Qtax has adeptly pointed out, it's hard to tell without more information.

    The second part of your question is, unfortunately, impossible to answer. It's basically the Halting problem. Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not.

    As far as some tips for making your regular expressions run faster? That's a big can of worms. I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps:

    1. Compile your regular expressions outside of your loops, if your language supports it.
    2. Whenever possible, add anchors when you know they're useful. Especially the ^ for the beginning of the string. See also: Word Boundaries
    3. Avoid nested repetition like the plague. If you have to have it (which you will), do your best to provide hints to the engine to short circuit any unintended backtracking.
    4. Take advantage of flavor constructs to speed things up. I'm partial to Non-Capturing groups and possessive quantifiers. They don't appear in every flavor, but when they do, you should use them. Also check out Atomic Groups
    5. I always find this to be true: The longer your regular expression gets, The more trouble you're going to have making it efficient. Regular expressions are a great and powerful tool, they're like a super smart hammer. Don't fall into the trap of seeing everything as a nail. Sometimes the string function you're looking for is right under your nose.

    Hope this helps you. Good luck.

    0 讨论(0)
  • 2020-12-18 08:55

    For the first regex:

    [A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*
    

    The catastrophic backtracking happens due to (\s*[a-z]+)* as pointed out in the comment. However, it only holds true if you are validating the string with String.matches(), since this is the only case where encountering an invalid character causes the engine to try and backtrack, rather than returning a match (Matcher loop).

    Let us match an invalid string against (\s*[a-z]+)*:

    inputstringinputstring;
    
    (Repetition 1)
    \s*=(empty)
    [a-z]+=inputstringinputstring
    FAILED
    
    Backtrack [a-z]+=inputstringinputstrin
    (Repetition 2)
    \s*=(empty)
    [a-z]+=g
    FAILED
    
    (End repetition 2 since all choices are exhausted)
    Backtrack [a-z]+=inputstringinputstri
    (Repetition 2)
    \s*=(empty)
    [a-z]+=ng
    FAILED
    
    Backtrack [a-z]+=n
    (Repetition 3)
    \s*(empty)
    [a-z]+=g
    FAILED
    
    (End repetition 3 since all choices are exhausted)
    (End repetition 2 since all choices are exhausted)
    Backtrack [a-z]+=inputstringinputstr
    

    By now, you should have notice the problem. Let us define T(n) as the amount of work to check a string of length n does not match the pattern. From the method of backtracking, we know T(n) = Sum [i = 0..(n-1)] T(i). From that, we can derive T(n + 1) = 2T(n), which means that the backtracking process is exponential in time complexity!

    Changing * to + avoids the problem completely, since an instance of repetition can only start at the boundary between a space character and an English alphabet character. In contrast, the first regex allows an instance of repetition to start in-between any 2 alphabet characters.

    To demonstrate this, (\s+[a-z]+\s*)* will give you backtracking hell when the invalid input string contains many words which are separated with multiple consecutive spaces, since it allows multiple places for a repetition instance to start. This follows the formula b^d where b is the maximum number of consecutive spaces (minus 1) and d is the number of sequences of spaces. It is less severe than the first regex you have (it requires at least one Englsh alphabet and one space character per repetition, as opposed to one English alphabet per repetition in your first regex), but it is still a problem.

    0 讨论(0)
提交回复
热议问题