How can I efficiently match many different regex patterns in Perl?

前端未结

关注

 8  2134

I have a growing list of regular expressions that I am using to parse through log files searching for \"interesting\" error and debug statements. I\'m currently breaking th

相关标签:

8条回答

礼貌的吻别

2020-12-18 08:38
You might want to take a look at Regexp::Assemble. It's intended to handle exactly this sort of problem.

Boosted code from the module's synopsis:
```
use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])
```
You can even slurp your regex collection out of a separate file.
0 讨论(0)
发布评论:

提交评论
- 加载中...

执念已碎

2020-12-18 08:39

Your example regular expressions look like they are based mainly on ordinary words and phrases. If that's the case, you might be able to speed things up considerably by pre-filtering the input lines using index, which is much faster than a regular expression. Under such a strategy, every regular expression would have a corresponding non-regex word or phrase for use in the pre-filtering stage. Better still would be to skip the regular expression test entirely, wherever possible: two of your example tests do not require regular expressions and could be done purely with index.

Here is an illustration of the basic idea:

use strict;
use warnings;

my @checks = (
    ['Failed',    qr/Failed in routing out/  ],
    ['failed',    qr/Agent .+ failed/        ],
    ['Not Exist', qr/Record Not Exist in DB/ ],
);
my @filter_strings = map { $_->[0] } @checks;
my @regexes        = map { $_->[1] } @checks;

sub regex {
    my $line = shift;
    for my $reg (@regexes){
        return 1 if $line =~ /$reg/;
    }
    return;
}

sub pre {
    my $line = shift;
    for my $fs (@filter_strings){
        return 1 if index($line, $fs) > -1;
    }
    return;
}

my @data = (
    qw(foo bar baz biz buz fubb),
    'Failed in routing out.....',
    'Agent FOO failed miserably',
    'McFly!!! Record Not Exist in DB',
);

use Benchmark qw(cmpthese);
cmpthese ( -1, {
    regex => sub { for (@data){ return $_ if(            regex($_)) } },
    pre   => sub { for (@data){ return $_ if(pre($_) and regex($_)) } },
} );

Output (results with your data might be very different):

             Rate     regex prefilter
regex     36815/s        --      -54%
prefilter 79331/s      115%        --

0 讨论(0)

上一页 1 2