Non-greedy string regular expression matching

后端 未结 2 2035
独厮守ぢ
独厮守ぢ 2020-11-28 10:21

I\'m pretty sure I\'m missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match(\'xxx a         


        
2条回答
  •  时光说笑
    2020-11-28 11:00

    The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.

    See the examples:

    > library(stringr)
    > str_extract('xxx aaaab yyy', "a[^ab]*b")
    [1] "ab"
    > str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
    [1] "xxx aaa xxx aaa zzz"
    > str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
    [1] "xxx aaa zzz"
    

    The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.

    Scenario 1. Boundaries are single characters

    In case a and b are single characters, the shortest window is found by using a negated character class. a[^ab]*b will easily grab the substring from a till the next b with no as and bs in between.

    Scenario 2. Boundaries are not single characters

    You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz pattern matches xxx, then any 0+ chars other than a linebreak char that is not the starting char of a xxx or zzz char sequence (the (?!xxx|zzz) is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz.

    These matching scenarios can be easily used with base R regmatches (using a PCRE regex flavor that supports lookaheads):

    > x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
    > unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
    [1] "xxx aaa zzz" "xxx ccc zzz"
    

    One note: when using a PCRE regex in base R, or the ICU regex in str_extract/str_match, the . does not match linebreak characters, to enable that behavior, you need to add (?s) at the pattern start (an inline DOTALL modifier).

提交回复
热议问题