regular expression goes into infinite loop

前端未结

关注

 2  1278

無奈伤痛 2020-12-18 07:52

I am parsing (species) names of the form:

Parus Ater
H. sapiens
T. rex
Tyr. rex

which normally have two terms (binomial) but sometimes have

2条回答

南方客 (楼主)

2020-12-18 08:55
For the first regex:
```
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*
```
The catastrophic backtracking happens due to (\s*[a-z]+)* as pointed out in the comment. However, it only holds true if you are validating the string with String.matches(), since this is the only case where encountering an invalid character causes the engine to try and backtrack, rather than returning a match (Matcher loop).

Let us match an invalid string against (\s*[a-z]+)*:
```
inputstringinputstring;

(Repetition 1)
\s*=(empty)
[a-z]+=inputstringinputstring
FAILED

Backtrack [a-z]+=inputstringinputstrin
(Repetition 2)
\s*=(empty)
[a-z]+=g
FAILED

(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstri
(Repetition 2)
\s*=(empty)
[a-z]+=ng
FAILED

Backtrack [a-z]+=n
(Repetition 3)
\s*(empty)
[a-z]+=g
FAILED

(End repetition 3 since all choices are exhausted)
(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstr
```
By now, you should have notice the problem. Let us define T(n) as the amount of work to check a string of length n does not match the pattern. From the method of backtracking, we know T(n) = Sum [i = 0..(n-1)] T(i). From that, we can derive T(n + 1) = 2T(n), which means that the backtracking process is exponential in time complexity!

Changing * to + avoids the problem completely, since an instance of repetition can only start at the boundary between a space character and an English alphabet character. In contrast, the first regex allows an instance of repetition to start in-between any 2 alphabet characters.

To demonstrate this, (\s+[a-z]+\s*)* will give you backtracking hell when the invalid input string contains many words which are separated with multiple consecutive spaces, since it allows multiple places for a repetition instance to start. This follows the formula b^d where b is the maximum number of consecutive spaces (minus 1) and d is the number of sequences of spaces. It is less severe than the first regex you have (it requires at least one Englsh alphabet and one space character per repetition, as opposed to one English alphabet per repetition in your first regex), but it is still a problem.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...