Is regex too slow? Real life examples where simple non-regex alternative is better

后端 未结 5 1371
粉色の甜心
粉色の甜心 2020-12-05 13:18

I\'ve seen people here made comments like \"regex is too slow!\", or \"why would you do something so simple using regex!\" (and then present a 10+ lines alternative instead)

5条回答
  •  情书的邮戳
    2020-12-05 14:11

    I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.

    The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.

    In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

    1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
    42,12,13,12,32,13,14,43,56,31,78,10 // match
    42,12,13,12,32,14,13,43,56,31,78,10 // don't match
    

    We use a regex containing exactly 11 commas:

    ".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"
    

    This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

    A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

    The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

    So we replace the greedy quantifier with the reluctant one:

    ".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"
    

    This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

    A performant regex replaces the dot by the character class "[^,]":

    "[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"
    

    (This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)

提交回复
热议问题