How can I fix my regex to not match too much with a greedy quantifier? [duplicate]

问题

I have the following line:

"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"

I parse this by using a simple regexp:

if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
    my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}

But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?

回答1:

The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.

What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:

(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)

回答2:

(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)

should work better

回答3:

Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.

Instead, I'd suggest something like this:

$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";

if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
    my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
    print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}

This results in:

[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]

I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.

回答4:

Try making the first 3 (.*) ungreedy (.*?)

回答5:

If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.

(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)

回答6:

You could make * non-greedy by appending a question mark:

$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/

or you can match everything except a semicolon in each part except the last:

$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/

来源：https://stackoverflow.com/questions/255815/how-can-i-fix-my-regex-to-not-match-too-much-with-a-greedy-quantifier

标签

regex

perl

parsing

greedy

regex-greedy