问题
I have the following line:
"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"
I parse this by using a simple regexp:
if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}
But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?
回答1:
The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.
What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:
(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)
回答2:
(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)
should work better
回答3:
Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.
Instead, I'd suggest something like this:
$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";
if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}
This results in:
[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]
I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.
回答4:
Try making the first 3 (.*)
ungreedy (.*?)
回答5:
If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.
(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)
回答6:
You could make * non-greedy by appending a question mark:
$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/
or you can match everything except a semicolon in each part except the last:
$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/
来源:https://stackoverflow.com/questions/255815/how-can-i-fix-my-regex-to-not-match-too-much-with-a-greedy-quantifier