Why does a simple .*? non-greedy regex greedily include additional characters before a match?

我怕爱的太早我们不能终老 提交于 2020-01-17 12:56:06

问题


I have a very simple regex similar to this:

HOHO.*?_HO_

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

Why? How can I make it match the shortest match?

Adding and removing the ? gives the same result.

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.


回答1:


I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".


You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one



回答2:


Why it matches the whole string?

This is because regular-expression pattern matching is done by finding the first position in the string at which a match is possible. Since a match is possible starting at the first character of the string, shorter matches starting at subsequent characters are never even considered.

Example:
Let's consider a regular expression /a+?b/ and test string "aaaaaaaaab". When applied to the string it matches the whole string. Not just last a & b. This is because the first position in the string where a match is possible is at the first a.

So, if you want to match ab in aaaaaaaaab, use a negated character class based regex rather than a lazy dot:

a[^ab]*b

See the regex demo.

Source: Javascript: The Definitive Guide, Sixth Edition, Page Number: 255




回答3:


The result is non-greedy, because it's the shortest match from the first occurrence of HOHO until _HO_ is reached; the engine traverses the string from left to right and because it doesn't have to backtrack, it won't attempt to shorten anything.

To make it work in the way that's expected here, you need to have a greedy prefix in your expression:

/.*(HOHO.*?_HO_)/

The first memory capture contains the string that you're after; the greedy prefix will try to skip as many characters as possible, so it will match the last occurrence of HOHO first.



来源:https://stackoverflow.com/questions/59645041/regular-expression-with-non-greedy-in-python-gets-unexpected-result

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!