问题
I have a complexe precompiled regular expression in Perl. For most cases the regex is fine and matches everything it should and nothing it shouldn't. Except one point.
Basically my regex looks like:
my $regexp = qr/\b(FOO|BAR|\*)\b/;
Unfortunately m/\b\*\b/
won't match example, *
. Only m/\*/
will do which I can't use because of false positives. Is there any workaround?
from the comments - false positives are: **
, example*
, exam*ple
what the regex is intended for? - It should extract keywords (one is a single asterisk) coworkers have entered into product data. the goal is to move this information out of a freetext field into an atomic one.
回答1:
It sounds like you want to treat *
as a word character.
\b
is equivalent to
(?x: (?<!\w)(?=\w) | (?<=\w)(?!\w) )
so you want
(?x: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
Applied, you get the following:
qr/
(?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
(FOO|BAR|\*)
(?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
/x
But given our knowledge of the middle expression, that can be simplified to the following:
qr/(?<![\w*])(FOO|BAR|\*)(?![\w*])/
回答2:
The problem is that Perl does not consider *
to be a "word character", and thus does not recognize a word boundary between a space and an asterisk (whereas it does recognize one between the r
and the *
in foobar*
).
The solution is to first decide what you do want to consider "word" and "non-word" characters, and then check for that explicitly. For example, if you want your words to consist only of letters 'A' to 'Z' (or their lowercase versions) and *
, and for everything else to be treated as non-word characters, you can use:
/(?<![A-Za-z*])(FOO|BAR|\*)(?![A-Za-z*])/
This will match the strings FOO
, BAR
or *
, provided that they're not preceded or followed by a character that matches [A-Za-z*]
.
Similarly, if you, say, want to consider everything except whitespace as non-word characters, you could use:
/(?<!\S)(FOO|BAR|\*)(?!\S)/
which will match FOO
, BAR
or *
, provided that they're not preceded or followed by a non-whitespace character.
回答3:
How about:
my $regexp = qr/(?:\b(FOO|BAR)\b)|(?:^| )\*(?:$| )/;
In action:
my $re = qr~(?:\b(FOO|BAR)\b)|(?:^| )\*(?:$| )~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK : $_" : "KO : $_");
}
__DATA__
FOO
BAR
*
exam*ple
example*
Output:
OK : FOO
OK : BAR
OK : *
KO : exam*ple
KO : example*
来源:https://stackoverflow.com/questions/21556526/how-to-use-word-break-asterisk-word-break-in-regex-with-perl