PHP PREG_JIT_STACKLIMIT_ERROR - inefficient regex

前端 未结 2 741
予麋鹿
予麋鹿 2020-12-19 10:12

I am getting PREG_JIT_STACKLIMIT_ERROR error in preg_replace_callback() function when working with a bit longer string. Above 2000 characters it is not woking (

2条回答
  •  渐次进展
    2020-12-19 10:31

    What is PCRE JIT?

    Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching. However, it comes at the cost of extra processing before the match is performed. Therefore, it is of most benefit when the same pattern is going to be matched many times.

    and how does it work basically?

    PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where the local data of the current node is pushed before checking its child nodes... When the compiled JIT code runs, it needs a block of memory to use as a stack. By default, it uses 32K on the machine stack. However, some large or complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack.

    By first quote you will understand JIT is an optional feature that is on by default in PHP [v7.*] PCRE. So you can easily turn it off: pcre.jit = 0 (it's not recommended though)

    However, while receiving error code #6 of preg_* functions it means possibly JIT hits the stack size limit.

    Since capturing groups consume more memory than non-capturing groups (even more memory is intended to be used as per type of quantifier(s) of clusters):

    1. Capturing group OP_CBRA (pcre_jit_compile.c:#1138) - (real memory is more than this):
    case OP_CBRA:
    case OP_SCBRA:
    bracketlen = 1 + LINK_SIZE + IMM2_SIZE;
    break;
    
    1. Non-capturing group OP_BRA (pcre_jit_compile.c:#1134) - (real memory is more than this):
    case OP_BRA:
    bracketlen = 1 + LINK_SIZE;
    break;
    

    Therefore changing capturing groups to non-capturing groups in your own RegEx makes it to give proper output (which I don't know exactly how much memory is saved by that)

    But it seems you need capturing groups and they are necessary. Then you should re-write your RegEx for the sake of performance. Backtracking is almost everything in a RegEx that should be considered.

    Update #1

    Solution:

    (?(DEFINE)
      (?
        (?! {@|@} ) [^|] [^{@|\\]* ( \\.[^{@|\\]* )* | (?R)
      )
    )
    {@
    (? \w+)-
    (? (%?\w++ (:\w+)*)* )
    (? [|] [^{@|]*+ (?&recurs)* )
    (? [|] (?&recurs)* )?
    \s*@}
    

    Live demo

    PHP code (watch backslash escaping):

    preg_match_all('/(?(DEFINE)
      (?
        (?! {@|@} ) [^|] [^{@|\\\\]* ( \\\\.[^{@|\\\\]* )* | (?R)
      )
    )
    {@
    (? \w+ )-
    (? (%?\w++ (:\w+)*)* )
    (? [|] [^{@|]*+ (?&recurs)* )
    (? [|] (?&recurs)* )?
    \s*@}/x', $string, $matches);
    

    This is your own RegEx that is optimized in a way to have least backtracking steps. So whatever was supposed to be matched by your own one is matched by this too.

    RegEx without following nested if blocks:

    {@
    (? \w+)-
    (? (%?\w++ (:\w+)*)* )
    (? [|] [^|\\]* (?: \\.[^|\\]* )* )
    (? [|] \X*)?
    @}
    

    Live demo

    Most of quantifiers are written possessively (avoids backtrack) by appending + to them.

提交回复
热议问题