The Complexity of (simplified) Regex Matching

问题

I just wonder the complexity of this regex matching problem: given a string of small letters and a matching rule, determine whether the rule may match the WHOLE string. The rule is a simplified regex which only contains smaller letters and/or '.' (period) and/or '*' (asterisk). A period may match any small letter where an asterisk may match zero or more of the preceding element.

Here are some examples:

isMatch("aa","a") is false
isMatch("aa","aa") is true
isMatch("aaa","aa") is false
isMatch("aa", "a*") is true
isMatch("aa", ".*") is true
isMatch("ab", ".*") is true
isMatch("aab", "c*a*b") is true

It is said that this problem could be solved in polynomial time. I just wonder how. By intuition, matching "aaaaaaaaaa" with a regex like ".*a.*" makes it hard to decide state transition when match with a finite deterministic machine. Any comments?

Thank you.

回答1:

You can solve this in polynomial time by using a dynamic programming algorithm. The idea is to answer queries of the following form:

Can you match the last m characters of the string using the last n characters of the regular expression?

The idea is to use a recursive algorithm, then either memoize the results or use dynamic programming to cache the results. The recursive algorithm works as follows:

If the regular expression is empty, it only matches the empty string.
If the regular expression's second character isn't *, then the regular expression matches the string iff the first character of the string matches the regex and the rest of the string matches the rest of the regex.
If the regular expression's second character is *, then the regular expression matches the string iff one of the following is true:
- The first character of the regular expression matches the first character of the string, and the same regular expression matches the remainder of the string.
- The first character of the regular expression matches the first character of the string, and the regular expression with the *'ed expression removed matches the rest of the string.
- The first character of the regular expression doesn't match the first character of the string, but the regex formed by removing the *'ed expression matches the string.

Each of these cases makes a recursive call where either the string or the regex is shorter and does O(1) work aside from the recursive calls. Since there are only Θ(mn) possible subproblems (one for each combination of a suffix of the regex and a suffix of the original string), using memoization this problem can be solved in Θ(mn) time.

Hope this helps!

来源：https://stackoverflow.com/questions/15187888/the-complexity-of-simplified-regex-matching

标签

regex

algorithm

time-complexity