regular expression greedy on left side only (.net)

回眸只為那壹抹淺笑 提交于 2019-12-23 04:34:07

问题


I am trying to capture matches between two strings.

For example, I am looking for all text that appears between Q and XYZ, using the "soonest" match (not continuing to expand outwards). This string:

circus Q hello there Q SOMETEXT XYZ today is the day XYZ okay XYZ

Should return:

Q SOMETEXT XYZ

But instead, it returns:

Q hello there Q SOMETEXT XYZ

Here is the expression I'm using: Q.*?XYZ

It's going too far back to the left. It's working fine on the ride side when I use the question mark after the asterisk. How can I do the same for the left side, and stop once I hit that first left Q, making it work the same as the right side works? I've tried question marks and other symbols from http://msdn.microsoft.com/en-us/library/az24scfc.aspx, but there's something I'm just not figuring out.

I'm a regex novice, so any help on this would be appreciated!


回答1:


Well, the non Greedy match is working - it gets the shortest string that satisfies the regex. The thing that you have to remember is that regex is a left to right process. So it matches the first Q, then gets the shortest number of characters followed by an XYZ. If you want it not to go past any Qs, you have to use a negated character class:

Q[^Q]*?XYZ

[^Q] matches any one character that is not a Q. Mind that this will only work for a single character. If your opening delimeter is multiple characters, you have to do it a different way. Why? Well, take the delimiter 'PQR' and the string is

foo PQR bar XYZ 

If you try to use the regex from before, but you extended the character class to :

PQR[^PQR]*?XYZ

then you'll get

'PQR bar XYZ'

As you expected. But if your string is

foo PQR Party Time! XYZ 

You'll get no matches. It's because [] delineates a "character class" - which matches exactly one character. Using these classes, you can match a range of characters, simply by listing them.

th[ae]n

will match both 'than' and 'then', but not 'thin'. Placing a carat ('^') at the beginning negates the class - meaning "match anything but these characters" - so by turning our one-character delimiter into [^PQR], rather than saying "not 'PQR'", you're saying "not 'P', 'Q', or 'R'". You can still use this if you want, but only if you're 100% sure that the characters from your delimiter will only be in your delimiter. If that's the case, it's faster to use greedy matching and only negate the first character of your delimiter. The regex for that would be:

PQR[^P]*XYZ 

But, if you can't make that guarantee, then match with:

PQR(?:.(?!PQR))*?XYZ

Regex doesn't directly support negative string matching (because it's impossible to define, when you think about it), so you have to use a negative lookahead.

(?!PQR)

is just such a lookahead. It means "Assert that the next few characters are not this internal regex", without matching any characters, so

.(?!PQR)

matches any character not followed by PQR. Wrap that in a group so that you can lazily repeat it,

(.(?!PQR))*?

and you have a match for "string that doesn't contain my delimiter". The only thing I did was add a ?: to make it a non-capturing group.

(?:.(?!PQR))*?

Depending on the language you use to parse your regex, it may try to pass back every matched group individually (useful for find and replace). This keeps it from doing that.

Happy regexing!




回答2:


The concept of greediness only works on the right side.

To make the expression only match from the last Q before XYZ, make it not match Q between them:

Q[^Q]*?XYZ


来源:https://stackoverflow.com/questions/12186389/regular-expression-greedy-on-left-side-only-net

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!