Regex not being greedy enough

早过忘川 提交于 2019-12-10 18:59:59

问题


I've got the following regex that was working perfectly until a new situation arose

^.*[?&]U(?:RL)?=(?<URL>.*)$

Basically, it's used against URLs, to grab EVERYTHING after the U=, or URL= and return it in the URL match

So, for the following

http://localhost?a=b&u=http://otherhost?foo=bar

URL = http://otherhost?foo=bar

Unfortunately an odd case came up

http://localhost?a=b&u=http://otherhost?foo=bar&url=http://someotherhost

Ideally, I want URL to be "http://otherhost?foo=bar&url=http://someotherhost", instead, it is just "http://someotherhost"

EDIT: I think this fixed it...though it's not pretty

^.*[?&](?<![?&]U(?:RL)?=.*)U(?:RL)?=(?<URL>.*)$

回答1:


The issue

The problem is not that .* is not being greedy enough; it's that the other .* that appears earlier is also greedy.

To illustrate the issue, let's consider a different example. Consider the following two patterns; they're identical, except in reluctance of \1 in second pattern:

              \1 greedy, \2 greedy         \1 reluctant, \2 greedy
              ^([0-5]*)([5-9]*)$           ^([0-5]*?)([5-9]*)$

Here we have two capturing groups. \1 captures [0-5]*, and \2 captures [5-9]*. Here's a side-by-side comparison of what these patterns match and capture:

              \1 greedy, \2 greedy          \1 reluctant, \2 greedy
              ^([0-5]*)([5-9]*)$            ^([0-5]*?)([5-9]*)$
Input         Group 1    Group 2            Group 1    Group 2
54321098765   543210     98765              543210     98765
007           00         7                  00         7
0123456789    012345     6789               01234      56789
0506          050        6                  050        6
555           555        <empty>            <empty>    555
5550555       5550555    <empty>            5550       555

Note that as greedy as \2 is, it can only grab what \1 didn't already grab first! Thus, if you want to make \2 grab as many 5 as possible, you have to make \1 reluctant, so the 5 is actually up for grab by \2.

Attachments

  • The first pattern on rubular.com
  • The second pattern on rubular.com

Related questions

  • Regular expression: who's greedier

The fix

So applying this to your problem, there are two ways that you can fix this: you can make the first .* reluctant, so (see on rubular.com):

^.*?[?&]U(?:RL)?=(?<URL>.*)$

Alternatively you can just get rid of the prefix matching part altogether (see on rubular.com):

[?&]U(?:RL)?=(?<URL>.*)$


来源:https://stackoverflow.com/questions/3045946/regex-not-being-greedy-enough

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!