问题

I was testing two almost identical regexes against a string (on regex101.com), and I noticed that there was a huge difference in the number of steps that they were taking. Here are the two regexes:

(Stake: £)(\d+(?:\.\d+)?)

(winnings: £)(\d+(?:\.\d+)?)

This is the string I was running them against (with modifiers g, i, m, u):

Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...

_{side note: the string/regexes are not mine, I just took them from elsewhere on SE and saw this behavior happening. It's not relevant to this question, but I figured it needed attribution.}

(Note: Regex101 seems to be backed by PHP's PCRE regexes. The "optimizations off" setting is probably PCRE_NO_START_OPTIMIZE | PCRE_NO_AUTO_POSSESS. Thanks to Lucas Trzesniewski for figuring this out and writing some C# code to test it with.)

With optimizations on: the first one takes 304 steps to match, while the second one takes 21 steps. This is the huge discrepancy I am wondering about.

With optimizations off: the first one takes 333 steps to match, while the second one takes 317 steps. This would indicate that the first pattern is failing to be optimized, but not the second one.

Interestingly enough, without the u modifier, the first pattern (with optimizations on) only takes 40 steps. (But it doesn't change the performance of anything else, nor what the regex eventually matches.)

I'd like to know what in regex101's optimization engine (PCRE) is causing this discrepancy in the amount of steps. I am specifically looking for the reason that the longer regex takes less steps when unicode is enabled.

Is there a reason why this is happening or is this a bug in the engine?

回答1:

Regex101 uses the PCRE library (by default), so its debugger is tied to PCRE's behavior. The PCRE library supports an auto-callout option through its PCRE_AUTO_CALLOUT flag, which invokes a callback function at every step of matching. I'm 99.99% sure that's how regex101's debugger works. Each callout invocation registers as a step in the debugger.

Testing with regex101

Now, take a look at the different steps involved, first, without the u option:

Notice how the text cursor jumps from the Start Game part directly to the Stake part.

What happens when you add u?

Notice the jump doesn't occur anymore, and this is where the additional steps come from.

What does the u option do? It sets PCRE_UTF8 | PCRE_UCP - yes, both of those at once, and this PCRE_UCP flag is important here.

Here's what the pcreunicode docs say:

Case-insensitive matching applies only to characters whose values are less than 128, unless PCRE is built with Unicode property support. A few Unicode characters such as Greek sigma have more than two codepoints that are case-equivalent. Up to and including PCRE release 8.31, only one-to-one case mappings were supported, but later releases (with Unicode property support) do treat as case-equivalent all versions of characters such as Greek sigma.

Because of the additional complexity in handling case insensitivity in Unicode mode, the engine cannot just skip all of the text. To verify this is the culprit, let's try this with the gmu flags (that is, with u but without i):

The optimization is applied even with u, and this pretty much confirms the hypothesis.

Observations

The slow path you're seeing looks just like if PCRE_NO_START_OPTIMIZE had been used (and this is probably part of what regex101's disable internal engine optimizations does, along with PCRE_NO_AUTO_POSSESS, which is not relevant here).

More about it from the docs:

There are a number of optimizations that pcre_exec() uses at the start of a match, in order to speed up the process. For example, if it is known that an unanchored match must start with a specific character, it searches the subject for that character, and fails immediately if it cannot find it, without actually running the main matching function.

For some reason (which will become apparent later), PCRE fails to register an s or k letter as a required starting character, and cannot use the anchoring optimization. All the other letters of the latin alphabet work just fine in this regard.
This is why Stake requires more steps than winnings: the engine just doesn't skip the checks.

Testing with PCRE directly

Here's a test program I used with PCRE 8.38, which is the latest PCRE1 version available as of this post.

#include <stdio.h>
#include <string.h>
#include <pcre.h>

static int calloutCount;

static int callout_handler(pcre_callout_block *c) {
    ++calloutCount;
    return 0;
}

static void test_run(const char* pattern, pcre* re, pcre_extra* extra) {
    int rc, startOffset;
    int ovector[3 * 3];
    
    pcre_callout = callout_handler;
    calloutCount = 0;
    startOffset = 0;
    
    const char *subject = "Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...";
    
    for (;;) {
        rc = pcre_exec(re, extra, subject, strlen(subject), startOffset, 0, ovector, sizeof(ovector) / sizeof(int));
        if (rc < 0)
            break;
        startOffset = ovector[1];
    }
    
    printf("%-30s %s => %i\n", pattern, extra ? "(studied)    " : "(not studied)", calloutCount);
}

static void test(const char* pattern) {
    pcre *re;
    const char *error;
    int erroffset;
    pcre_extra *extra;
    
    re = pcre_compile(pattern, PCRE_AUTO_CALLOUT | PCRE_CASELESS | PCRE_MULTILINE | PCRE_UTF8 | PCRE_UCP, &error, &erroffset, 0);             
    if (re == 0)
        return;
    
    extra = pcre_study(re, 0, &error);
    
    test_run(pattern, re, extra);
    
    if (extra)
        test_run(pattern, re, NULL);
    
    pcre_free_study(extra); 
    pcre_free(re);
}

int main(int argc, char **argv) {   
    printf("PCRE version: %s\n\n", pcre_version());

    test("(Stake: £)(\\d+(?:\\.\\d+)?)");
    test("(winnings: £)(\\d+(?:\\.\\d+)?)");
    
    return 0;
}

The output I got is the following:

PCRE version: 8.38 2015-11-23

(Stake: £)(\d+(?:\.\d+)?)     (studied)     => 40
(Stake: £)(\d+(?:\.\d+)?)     (not studied) => 302
(winnings: £)(\d+(?:\.\d+)?)  (studied)     => 21
(winnings: £)(\d+(?:\.\d+)?)  (not studied) => 21

Here, we can see that studying the pattern makes a difference in the first case, but not in the second one.

Studying a pattern means the following:

Studying a pattern does two things: first, a lower bound for the length of subject string that is needed to match the pattern is computed. This does not mean that there are any strings of that length that match, but it does guarantee that no shorter strings match. The value is used to avoid wasting time by trying to match strings that are shorter than the lower bound. You can find out the value in a calling program via the pcre_fullinfo() function.

Studying a pattern is also useful for non-anchored patterns that do not have a single fixed starting character. A bitmap of possible starting bytes is created. This speeds up finding a position in the subject at which to start matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256. In 32-bit mode, the bitmap is used for 32-bit values less than 256.)

From the results and the docs description, you can conclude that PCRE considers the S character is not an anchoring character, and in caseless Unicode mode, it requires a bitmap to be created. The bitmap allows the optimization to be applied.

Now, here's the PCRE2 version, compiled against PCRE2 v10.21, which is the latest release as of this post. The results will be unsurprising as PCRE2 always studies the patterns you supply it, no questions asked.

#include <stdio.h>
#include <string.h>

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>

static int callout_handler(pcre2_callout_block *c, void *data) {
    ++*((int*)data);
    return 0;
}

static void test(const char* pattern) {
    pcre2_code *re;
    int error;
    PCRE2_SIZE erroffset;
    pcre2_match_context *match_context;
    pcre2_match_data *match_data;
    int rc, startOffset = 0;
    int calloutCount = 0;
    PCRE2_SIZE *ovector;
    
    const PCRE2_SPTR subject = (PCRE2_SPTR)"Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...";
    
    re = pcre2_compile((PCRE2_SPTR)pattern, PCRE2_ZERO_TERMINATED, PCRE2_AUTO_CALLOUT | PCRE2_CASELESS | PCRE2_MULTILINE | PCRE2_UTF | PCRE2_UCP, &error, &erroffset, 0);
    if (re == 0)
        return;
    
    match_context = pcre2_match_context_create(0);
    pcre2_set_callout(match_context, callout_handler, &calloutCount);
    
    match_data = pcre2_match_data_create_from_pattern(re, 0);
    ovector = pcre2_get_ovector_pointer(match_data);
    
    for (;;) {
        rc = pcre2_match(re, subject, PCRE2_ZERO_TERMINATED, startOffset, 0, match_data, match_context);
        if (rc < 0)
            break;
        startOffset = ovector[1];
    }
    
    printf("%-30s => %i\n", pattern, calloutCount);
    
    pcre2_match_context_free(match_context);
    pcre2_match_data_free(match_data);
    pcre2_code_free(re);
}

int main(int argc, char **argv) {
    char version[256];
    pcre2_config(PCRE2_CONFIG_VERSION, &version);
    printf("PCRE version: %s\n\n", version);
    
    test("(Stake: £)(\\d+(?:\\.\\d+)?)");
    test("(winnings: £)(\\d+(?:\\.\\d+)?)");
    
    return 0;
}

And here's the result:

PCRE version: 10.21 2016-01-12

(Stake: £)(\d+(?:\.\d+)?)     => 40
(winnings: £)(\d+(?:\.\d+)?)  => 21

Yup. Same results as in PCRE1 when studying was used.

Implementation details

Let's take a look at the implementation details, shall we? PCRE compiles a pattern to opcodes, which we can read with the pcretest program.

Here's the input file:

/(Stake: £)(\d+(?:\.\d+)?)/iDW8
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...

/(winnings: £)(\d+(?:\.\d+)?)/iDW8
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...

Its format is simple: A pattern in delimiters followed by options, and one or more input strings.

The options are:

i - PCRE_CASELESS
D - turn on debug mode: show the opcodes and pattern information
W - PCRE_UCP
8 - PCRE_UTF8

And the result is...

PCRE version 8.38 2015-11-23

/(Stake: £)(\d+(?:\.\d+)?)/iDW8
------------------------------------------------------------------
  0  55 Bra
  3  24 CBra 1
  8     clist 0053 0073 017f
 11  /i ta
 15     clist 004b 006b 212a
 18  /i e: \x{a3}
 27  24 Ket
 30  22 CBra 2
 35     prop Nd ++
 39     Brazero
 40   9 Bra
 43  /i .
 45     prop Nd ++
 49   9 Ket
 52  22 Ket
 55  55 Ket
 58     End
------------------------------------------------------------------
Capturing subpattern count = 2
Options: caseless utf ucp
No first char
Need char = ' '
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...
 0: Stake: \x{a3}2.00
 1: Stake: \x{a3}
 2: 2.00

/(winnings: £)(\d+(?:\.\d+)?)/iDW8
------------------------------------------------------------------
  0  60 Bra
  3  29 CBra 1
  8  /i winning
 22     clist 0053 0073 017f
 25  /i : \x{a3}
 32  29 Ket
 35  22 CBra 2
 40     prop Nd ++
 44     Brazero
 45   9 Bra
 48  /i .
 50     prop Nd ++
 54   9 Ket
 57  22 Ket
 60  60 Ket
 63     End
------------------------------------------------------------------
Capturing subpattern count = 2
Options: caseless utf ucp
First char = 'w' (caseless)
Need char = ' '
Start Game, Credit: £200.00game num: 1, Stake: £2.00Spinning Reels:NINE SEVEN KINGKING STAR ACEQUEEN JACK KINGtotal winnings: £0.00End Game, Credit: £198Start...
 0: winnings: \x{a3}0.00
 1: winnings: \x{a3}
 2: 0.00

Now this gets interesting!

First, we see the first pattern gets No first char, while the second one gets First char = 'w' (caseless).

The S is identified as: clist 0053 0073 017f, while the k is turned into clist 004b 006b 212a. Those are the sets of matching codepoints for these letters. What do they represent?

S: 0053 = S, 0073 = s, 017f = ſ (Latin Small Letter Long S) - Yikes.
k: 004b = K, 006b = k, 212a = K (Kelvin Sign) - Double yikes.

The optimization is not applied because case folding for these two letters include characters outside the ASCII range.

Now you see :)

The solution for PHP

So, what can you do from PHP? You add the S modifier:

When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

Regex101 doesn't support pattern studying AFAIK.

回答2:

As Lucas Trzesniewski pointed out, this behavior only happens with 2 letters within a-z: k and s. Which both occur in the first pattern, coincidentally.

I was wondering why I didn't see any improvement when I shortened the regex to (ke: £)(\d+(?:\.\d+)?), and that was why.

But I think I found the answer. I began digging through the PCRE source code, picking random files that seemed like they might be of interest. And I found pcre2_udc.c:

The tables herein are needed only when UCP support is built, and in PCRE2 that happens automatically with UTF support.

A little further down the page, I found this list of unicode points. I have put the corresponding Unicode character next to its hexadecimal value for quick reference:

const uint32_t PRIV(ucd_caseless_sets)[] = {
    NOTACHAR,
    0x0053, S 0x0073, s 0x017f, ſ NOTACHAR,
    0x01c4, Ǆ 0x01c5, ǅ 0x01c6, ǆ NOTACHAR,
    0x01c7, Ǉ 0x01c8, ǈ 0x01c9, ǉ NOTACHAR,
    0x01ca, Ǌ 0x01cb, ǋ 0x01cc, ǌ NOTACHAR,
    0x01f1, Ǳ 0x01f2, ǲ 0x01f3, ǳ NOTACHAR,
    0x0345, ͅ  0x0399, Ι 0x03b9, ι 0x1fbe,  ι NOTACHAR,
    0x00b5, µ 0x039c, Μ 0x03bc, μ NOTACHAR,
    0x03a3, Σ 0x03c2, ς 0x03c3, σ NOTACHAR,
    0x0392, Β 0x03b2, β 0x03d0, ϐ NOTACHAR,
    0x0398, Θ 0x03b8, θ 0x03d1, ϑ 0x03f4,  ϴ NOTACHAR,
    0x03a6, Φ 0x03c6, φ 0x03d5, ϕ NOTACHAR,
    0x03a0, Π 0x03c0, π 0x03d6, ϖ NOTACHAR,
    0x039a, Κ 0x03ba, κ 0x03f0, ϰ NOTACHAR,
    0x03a1, Ρ 0x03c1, ρ 0x03f1, ϱ NOTACHAR,
    0x0395, Ε 0x03b5, ε 0x03f5, ϵ NOTACHAR,
    0x1e60, Ṡ 0x1e61, ṡ 0x1e9b, ẛ NOTACHAR,
    0x03a9, Ω 0x03c9, ω 0x2126, Ω NOTACHAR,
    0x004b, K 0x006b, k 0x212a, K NOTACHAR,
    0x00c5, Å 0x00e5, å 0x212b, Å NOTACHAR,
}

As a result, this regex:

(ſtake: £)(\d+(?:\.\d+)?)

is equivalent to my first regex.

Again, it was Lucas who put me on the right direction with the link to the the pcreunicode docs:

Case-insensitive matching applies only to characters whose values are less than 128, unless PCRE is built with Unicode property support. A few Unicode characters such as Greek sigma have more than two codepoints that are case-equivalent. Up to and including PCRE release 8.31, only one-to-one case mappings were supported, but later releases (with Unicode property support) do treat as case-equivalent all versions of characters such as Greek sigma.

The engine was failing to optimize things when any of these characters are present.

But looking within pcre2_study.c, I marked (with /**/) the line where the optimizations may happen during studying:

      case OP_PROP:
      if (tcode[1] != PT_CLIST) return SSB_FAIL;
        {
        const uint32_t *p = PRIV(ucd_caseless_sets) + tcode[2];
/**/    while ((c = *p++) < NOTACHAR)
          {
#if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH == 8
          if (utf)
            {
            PCRE2_UCHAR buff[6];
            (void)PRIV(ord2utf)(c, buff);
            c = buff[0];
            }
#endif
          if (c > 0xff) SET_BIT(0xff); else SET_BIT(c);
          }
        }
      try_next = FALSE;
      break;

来源：https://stackoverflow.com/questions/65116542/why-do-atomic-groups-and-non-capturing-seem-to-add-many-steps

标签

regex