问题
I've got the following regex that was working perfectly until a new situation arose
^.*[?&]U(?:RL)?=(?<URL>.*)$
Basically, it's used against URLs, to grab EVERYTHING after the U=, or URL= and return it in the URL match
So, for the following
http://localhost?a=b&u=http://otherhost?foo=bar
URL = http://otherhost?foo=bar
Unfortunately an odd case came up
http://localhost?a=b&u=http://otherhost?foo=bar&url=http://someotherhost
Ideally, I want URL to be "http://otherhost?foo=bar&url=http://someotherhost", instead, it is just "http://someotherhost"
EDIT: I think this fixed it...though it's not pretty
^.*[?&](?<![?&]U(?:RL)?=.*)U(?:RL)?=(?<URL>.*)$
回答1:
The issue
The problem is not that .*
is not being greedy enough; it's that the other .*
that appears earlier is also greedy.
To illustrate the issue, let's consider a different example. Consider the following two patterns; they're identical, except in reluctance of \1
in second pattern:
\1 greedy, \2 greedy \1 reluctant, \2 greedy
^([0-5]*)([5-9]*)$ ^([0-5]*?)([5-9]*)$
Here we have two capturing groups. \1
captures [0-5]*
, and \2
captures [5-9]*
. Here's a side-by-side comparison of what these patterns match and capture:
\1 greedy, \2 greedy \1 reluctant, \2 greedy
^([0-5]*)([5-9]*)$ ^([0-5]*?)([5-9]*)$
Input Group 1 Group 2 Group 1 Group 2
54321098765 543210 98765 543210 98765
007 00 7 00 7
0123456789 012345 6789 01234 56789
0506 050 6 050 6
555 555 <empty> <empty> 555
5550555 5550555 <empty> 5550 555
Note that as greedy as \2
is, it can only grab what \1
didn't already grab first! Thus, if you want to make \2
grab as many 5
as possible, you have to make \1
reluctant, so the 5
is actually up for grab by \2
.
Attachments
- The first pattern on rubular.com
- The second pattern on rubular.com
Related questions
- Regular expression: who's greedier
The fix
So applying this to your problem, there are two ways that you can fix this: you can make the first .*
reluctant, so (see on rubular.com):
^.*?[?&]U(?:RL)?=(?<URL>.*)$
Alternatively you can just get rid of the prefix matching part altogether (see on rubular.com):
[?&]U(?:RL)?=(?<URL>.*)$
来源:https://stackoverflow.com/questions/3045946/regex-not-being-greedy-enough