I have the following REGEX expression (that works) to allow Alpha-Numeric (as well as '
and -
) and no double spacing:
^([a-zA-Z0-9'-]+\s?)*$
Due to the nested grouping, this allows Catastrophic Backtracking to happen - which is bad!
How can I simplify this expression to avoid Catastrophic Backtracking?? (Ideally this wouldn't allow white-space in first and last characters either)
Explanation
Nested group doesn't automatically causes catastrophic backtracking. In your case, it is because your regex degenerates to the classical example of catastrophic backtracking (a*)*
.
Since \s
in optional in ^([a-zA-Z0-9'-]+\s?)*$
, on input without any spaces but has characters outside the allowed list, the regex simply degenerates to ^([a-zA-Z0-9'-]+)*$
.
You can also think in term of expansion of the original regex:
[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?...
Since \s
is optional, we can remove it:
[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+...
And we got a series of consecutive [a-zA-Z0-9'-]+
, which will try all ways to distribute the characters between themselves and blow up the complexity.
Solution
The standard way to write a regex to match token delimiter token ... delimiter token
is token (delimiter token)*
. While it is possible to rewrite the regex avoid repeating token
, I'd recommend against it, since it is harder to get it right. To avoid repetition , you might want to construct the regex by string concatenation instead.
Following the recipe above:
^[a-zA-Z0-9'-]+(\s[a-zA-Z0-9'-]+)*$
Although you can see repetition in repetition here, there is no catastrophic backtracking, since the regex can only expand to:
[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+...
And \s
and [a-zA-Z0-9'-]
are mutual exclusive - there is only one way to match any string.
来源:https://stackoverflow.com/questions/27237579/simple-alphanumeric-regex-single-spacing-without-catastrophic-backtracking