Regex lookahead

喜欢而已 提交于 2019-12-05 10:31:00

I guess you could explore a greedy version.
(expanded)

(test:\? (?: (?!test:\?)[\s\S])* )

The Perl program below

#! /usr/bin/env perl

use strict;
use warnings;

$_ = "test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2";

while (/(test:\?  .*?) (?= test:\? | $)/gx) {
  print "[$1]\n";
}

produces the desired output from your question, plus brackets for emphasis.

[test:?foo2=bar2&baz2=foo2]
[test:?foo=bar&baz=foo]
[test:?foo2=bar2&baz2=foo2]

Remember that regex quantifiers are greedy and want to gobble up as much as they can without breaking the match. Each subsegment to terminate as soon as possible, which means .*? semantics.

Each subsegment terminates with either another test:? or end-of-string, which we look for with (?=...) zero-width lookahead wrapped around | for alternatives.

The pattern in the code above uses Perl’s /x regex switch for readability. Depending on the language and libraries you’re using, you may need to remove the extra whitespace.

Three issues:

  • (?!) is a negative lookahead assertion. You want (?=) instead, requiring that what comes next is test:?.

  • The .* is greedy; you want it non-greedy so that you grab just the first chunk.

  • You're wanting the last chunk also, so you want to match $ as well at the end.

End result:

(?:test:\?)(.*?)(?=test:\?|$)

I've also removed the outer group, seeing no point in it. All RE engines that I know of let you access group 0 as the full match, or some other such way (though perhaps not when finding all matches). You can put it back if you need to.

(This works in PCRE; not sure if it would work with POSIX regular expressions, as I'm not in the habit of working with them.)

If you're just wanting to split on test:?, though, regular expressions are the wrong tool. Split the strings using your language's inbuilt support for such things.

Python:

>>> re.findall('(?:test:\?)(.*?)(?=test:\?|$)',
... 'test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2')
['foo2=bar2&baz2=foo2', 'foo=bar&baz=foo', 'foo2=bar2&baz2=foo2']

You probably want ((?:test:\?)(.*?)(?=test:\?)), although you haven't told us what language you're using to drive the regexes.

The .*? matches as few characters as possible without preventing the whole string from matching, where .* matches as many as possible (is greedy).

Depending, again, on what language you're using to do this, you'll probably need to match, then chop the string, then match again, or call some language-specific match_all type function.

By the way, you don't need to anchor a regex using a lookahead (you can just match the pattern to search for, instead), so this will (most likely) do in your case:

test:[?](.*?)test:[?]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!