Why pcre regex is much faster than c++11 regex

天大地大妈咪最大 提交于 2020-01-04 04:55:07

问题


Some sample code. This is the c++11 part using cregex_iterator:

std::chrono::steady_clock::time_point begin0 = std::chrono::steady_clock::now();
regex re("<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>", regex::icase);
int found = 0;
for (std::cregex_iterator i = std::cregex_iterator(input, input + input_length, re);
i != std::cregex_iterator();
    ++i)
{
    found++;
    if (found < 10000) continue;
    break;
}
std::chrono::steady_clock::time_point end0 = std::chrono::steady_clock::now();

This is the pcre part. The regexp is all the same.

std::chrono::steady_clock::time_point begin4 = std::chrono::steady_clock::now();
const char *pError = NULL;
int errOffset;
int options = PCRE_MULTILINE | PCRE_CASELESS;
const char* regexp = "<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>";
pcre* pPcre = pcre_compile(regexp, options, &pError, &errOffset, 0);                
int offset = 0;
int matches = -1;
int pMatches[6];
while (offset < input_length)
{
    matches = pcre_exec(pPcre,NULL, input, input_length, offset,0, pMatches,6); 
    if (matches >= 1)
    {
        found++;
        offset = pMatches[1];
        if (found < 10000) continue;
        break;  // find match
    }
    else
        offset = input_length;
}

std::chrono::steady_clock::time_point end4 = std::chrono::steady_clock::now();

The result shows pcre is 100 times faster than c++11. I found some vector copy and resize in c++11 implementation. Are there some other reasons?


回答1:


PCRE benefits from some optimizations known as start-up optimizations which are configured to be enabled by default. These optimizations include:

  1. A subject pre-scan for unanchored patterns (if a starting point is not found engine doesn't even bother to go through matching process.)
  2. Studying pattern to ensure that minimum length of subject is not shorter than pattern itself
  3. Auto-possessification
  4. Fast failure (if a specific point is not found engine doesn't even bother to go through matching process.)

Superficial pattern analyzing:

<option             # Subject pre-scan applied (unachored pattern)
    [\\s]
    value
    [\\s]*          # Auto-possessification applied (translates to \s*+)
    =
    [\\s]*          # //
    \"([^\">]*)\"   
    [\\s]*          # //
    [^>]*
>                   # Min length (17 chars) check of subject string applied

Furthermore, if input string doesn't have a special character like >, a fast failure is supposed to be thrown. You should know that performance can depend on input string heavily as well.

Run below pattern:

(*NO_AUTO_POSSESS)(*NO_START_OPT)<option[\s]value[\s]*=[\s]*\"([^\">]*)\"[\s]*[^>]*>

over this input string (watch that period):

<option value                                                                 .

and compare result (Live demo).



来源:https://stackoverflow.com/questions/41481811/why-pcre-regex-is-much-faster-than-c11-regex

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!