Odd Behavior with Greedy Modifiers Inside Capture Groups

筅森魡賤 提交于 2020-02-03 04:23:32

问题


Consider the following commands:

text <- "abcdEEEEfg"

sub("c.+?E", "###", text)
# [1] "ab###EEEfg"                          <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg"                           <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg"                          <<< OKAY  

The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub result suggests the .+? piece left out EE, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.

The final one is exactly the same but run with PCRE, and that works as expected.

Am I missing something or is this behavior undocumented/buggy?


回答1:


R uses libtre, version 0.8. For more stability, you should always use perl = TRUE.

Note that

sub("c(.+?)E?", "###", text)

works.



来源:https://stackoverflow.com/questions/22056044/odd-behavior-with-greedy-modifiers-inside-capture-groups

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!