I am parsing (species) names of the form:
Parus Ater
H. sapiens
T. rex
Tyr. rex
which normally have two terms (binomial) but sometimes have
To address the first part of your question, you should read up on catastrophic backtracking. Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work.
In your case, it was probably the nested repitition: (\s*[a-z]+)*
Which likely caused some very very strange loops. As Qtax has adeptly pointed out, it's hard to tell without more information.
The second part of your question is, unfortunately, impossible to answer. It's basically the Halting problem. Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not.
As far as some tips for making your regular expressions run faster? That's a big can of worms. I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps:
^
for the beginning of the string. See also: Word BoundariesHope this helps you. Good luck.
For the first regex:
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*
The catastrophic backtracking happens due to (\s*[a-z]+)*
as pointed out in the comment. However, it only holds true if you are validating the string with String.matches()
, since this is the only case where encountering an invalid character causes the engine to try and backtrack, rather than returning a match (Matcher
loop).
Let us match an invalid string against (\s*[a-z]+)*
:
inputstringinputstring;
(Repetition 1)
\s*=(empty)
[a-z]+=inputstringinputstring
FAILED
Backtrack [a-z]+=inputstringinputstrin
(Repetition 2)
\s*=(empty)
[a-z]+=g
FAILED
(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstri
(Repetition 2)
\s*=(empty)
[a-z]+=ng
FAILED
Backtrack [a-z]+=n
(Repetition 3)
\s*(empty)
[a-z]+=g
FAILED
(End repetition 3 since all choices are exhausted)
(End repetition 2 since all choices are exhausted)
Backtrack [a-z]+=inputstringinputstr
By now, you should have notice the problem. Let us define T(n)
as the amount of work to check a string of length n does not match the pattern. From the method of backtracking, we know T(n) = Sum [i = 0..(n-1)] T(i)
. From that, we can derive T(n + 1) = 2T(n)
, which means that the backtracking process is exponential in time complexity!
Changing *
to +
avoids the problem completely, since an instance of repetition can only start at the boundary between a space character and an English alphabet character. In contrast, the first regex allows an instance of repetition to start in-between any 2 alphabet characters.
To demonstrate this, (\s+[a-z]+\s*)*
will give you backtracking hell when the invalid input string contains many words which are separated with multiple consecutive spaces, since it allows multiple places for a repetition instance to start. This follows the formula b^d
where b
is the maximum number of consecutive spaces (minus 1) and d
is the number of sequences of spaces. It is less severe than the first regex you have (it requires at least one Englsh alphabet and one space character per repetition, as opposed to one English alphabet per repetition in your first regex), but it is still a problem.