Complexity of Regex substitution

前端 未结 8 1325
傲寒
傲寒 2020-12-01 16:58

I didn\'t get the answer to this anywhere. What is the runtime complexity of a Regex match and substitution?

Edit: I work in python. But would like to know in genera

8条回答
  •  鱼传尺愫
    2020-12-01 17:07

    Other theoretical info of possible interest.

    For clarity, assume the standard definition for a regular expression

    http://en.wikipedia.org/wiki/Regular_language

    from the formal language theory. Practically, this means that the only building material are alphabet symbols, operators of concatenation, alternation and Kleene closure, along with the unit and zero constants (which appear for group-theoretic reasons). Generally it's a good idea not to overload this term despite the everyday practice in scripting languages which leads to ambiguities.

    There is an NFA construction that solves the matching problem for a regular expression r and an input text t in O(|r| |t|) time and O(|r|) space, where |-| is the length function. This algorithm was further improved by Myers

    http://doi.acm.org/10.1145/128749.128755

    to the time and space complexity O(|r| |t| / log |t|) by using automaton node listings and the Four Russians paradigm. This paradigm seems to be named after four Russian guys who wrote a groundbreaking paper which is not online. However, the paradigm is illustrated in these computational biology lecture notes

    http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf

    I find it hilarious to name a paradigm by the number and the nationality of authors instead of their last names.

    The matching problem for regular expressions with added backreferences is NP-complete, which was proven by Aho

    http://portal.acm.org/citation.cfm?id=114877

    by a reduction from the vertex-cover problem which is a classical NP-complete problem.

    To match regular expressions with backreferences deterministically we could employ backtracking (not unlike the Perl regex engine) to keep track of the possible subwords of the input text t that can be assigned to the variables in r. There are only O(|t|^2) subwords that can be assigned to any one variable in r. If there are n variables in r, then there are O(|t|^2n) possible assignments. Once an assignment of substrings to variables is fixed, the problem reduces to the plain regular expression matching. Therefore the worst-case complexity for matching regular expressions with backreferences is O(|t|^2n).

    Note however, regular expressions with backreferences are not yet full-featured regexen.

    Take, for example, the "don't care" symbol apart from any other operators. There are several polynomial algorithms deciding whether a set of patterns matches an input text. For example, Kucherov and Rusinowitch

    http://dx.doi.org/10.1007/3-540-60044-2_46

    define a pattern as a word w_1@w_2@...@w_n where each w_i is a word (not a regular expression) and "@" is a variable length "don't care" symbol not contained in either of w_i. They derive an O((|t| + |P|) log |P|) algorithm for matching a set of patterns P against an input text t, where |t| is the length of the text, and |P| is the length of all the words in P.

    It would be interesting to know how these complexity measures combine and what is the complexity measure of the matching problem for regular expressions with backreferences, "don't care" and other interesting features of practical regular expressions.

    Alas, I haven't said a word about Python... :)

提交回复
热议问题