问题
I have a complexe precompiled regular expression in Perl. For most cases the regex is fine and matches everything it should and nothing it shouldn't. Except one point.
Basically my regex looks like:
my $regexp = qr/\b(FOO|BAR|\*)\b/;
Unfortunately m/\b\*\b/ won't match example, *. Only m/\*/ will do which I can't use because of false positives. Is there any workaround?
from the comments - false positives are: **, example*, exam*ple
what the regex is intended for? - It should extract keywords (one is a single asterisk) coworkers have entered into product data. the goal is to move this information out of a freetext field into an atomic one.
回答1:
It sounds like you want to treat * as a word character.
\b
is equivalent to
(?x: (?<!\w)(?=\w) | (?<=\w)(?!\w) )
so you want
(?x: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
Applied, you get the following:
qr/
(?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
(FOO|BAR|\*)
(?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
/x
But given our knowledge of the middle expression, that can be simplified to the following:
qr/(?<![\w*])(FOO|BAR|\*)(?![\w*])/
回答2:
The problem is that Perl does not consider * to be a "word character", and thus does not recognize a word boundary between a space and an asterisk (whereas it does recognize one between the r and the * in foobar*).
The solution is to first decide what you do want to consider "word" and "non-word" characters, and then check for that explicitly. For example, if you want your words to consist only of letters 'A' to 'Z' (or their lowercase versions) and *, and for everything else to be treated as non-word characters, you can use:
/(?<![A-Za-z*])(FOO|BAR|\*)(?![A-Za-z*])/
This will match the strings FOO, BAR or *, provided that they're not preceded or followed by a character that matches [A-Za-z*].
Similarly, if you, say, want to consider everything except whitespace as non-word characters, you could use:
/(?<!\S)(FOO|BAR|\*)(?!\S)/
which will match FOO, BAR or *, provided that they're not preceded or followed by a non-whitespace character.
回答3:
How about:
my $regexp = qr/(?:\b(FOO|BAR)\b)|(?:^| )\*(?:$| )/;
In action:
my $re = qr~(?:\b(FOO|BAR)\b)|(?:^| )\*(?:$| )~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK : $_" : "KO : $_");
}
__DATA__
FOO
BAR
*
exam*ple
example*
Output:
OK : FOO
OK : BAR
OK : *
KO : exam*ple
KO : example*
来源:https://stackoverflow.com/questions/21556526/how-to-use-word-break-asterisk-word-break-in-regex-with-perl