Regular expression greedy match not working as expected

柔情痞子 提交于 2019-12-05 16:22:15

The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.

Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:

^(?>%?)\S{3}

If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.

Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as @Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.

Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".

Instead, what you probably want is something like this:

^(%\S{3}|[^%\s]\S{2})

Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.

I always love to look at RE questions to see how much time people spend on them to "Save time"

str.len() >= str[0]=='&' ? 4 : 3

Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)

Try the regex modified a little based on Dav's original one:

^(%\S{3,}|[^%\s]\S{2,})

with the regex option "^ and $ match at line breaks" on.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!