How to use word break, asterisk, word break in Regex with Perl?

蹲街弑〆低调 提交于 2021-02-04 22:21:12

问题


I have a complexe precompiled regular expression in Perl. For most cases the regex is fine and matches everything it should and nothing it shouldn't. Except one point.

Basically my regex looks like:

my $regexp = qr/\b(FOO|BAR|\*)\b/;

Unfortunately m/\b\*\b/ won't match example, *. Only m/\*/ will do which I can't use because of false positives. Is there any workaround?

from the comments - false positives are: **, example*, exam*ple

what the regex is intended for? - It should extract keywords (one is a single asterisk) coworkers have entered into product data. the goal is to move this information out of a freetext field into an atomic one.


回答1:


It sounds like you want to treat * as a word character.

\b

is equivalent to

(?x: (?<!\w)(?=\w) | (?<=\w)(?!\w) )

so you want

(?x: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )

Applied, you get the following:

qr/
    (?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
    (FOO|BAR|\*)
    (?: (?<![\w*])(?=[\w*]) | (?<=[\w*])(?![\w*]) )
/x

But given our knowledge of the middle expression, that can be simplified to the following:

qr/(?<![\w*])(FOO|BAR|\*)(?![\w*])/



回答2:


The problem is that Perl does not consider * to be a "word character", and thus does not recognize a word boundary between a space and an asterisk (whereas it does recognize one between the r and the * in foobar*).

The solution is to first decide what you do want to consider "word" and "non-word" characters, and then check for that explicitly. For example, if you want your words to consist only of letters 'A' to 'Z' (or their lowercase versions) and *, and for everything else to be treated as non-word characters, you can use:

/(?<![A-Za-z*])(FOO|BAR|\*)(?![A-Za-z*])/

This will match the strings FOO, BAR or *, provided that they're not preceded or followed by a character that matches [A-Za-z*].

Similarly, if you, say, want to consider everything except whitespace as non-word characters, you could use:

/(?<!\S)(FOO|BAR|\*)(?!\S)/

which will match FOO, BAR or *, provided that they're not preceded or followed by a non-whitespace character.




回答3:


How about:

my $regexp = qr/(?:\b(FOO|BAR)\b)|(?:^| )\*(?:$| )/;

In action:

my $re = qr~(?:\b(FOO|BAR)\b)|(?:^| )\*(?:$| )~;
while(<DATA>) {
    chomp;
    say (/$re/ ? "OK : $_" : "KO : $_");
}


__DATA__
FOO
BAR
*
exam*ple
example*

Output:

OK : FOO
OK : BAR
OK : *
KO : exam*ple
KO : example*


来源:https://stackoverflow.com/questions/21556526/how-to-use-word-break-asterisk-word-break-in-regex-with-perl

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!