How can I efficiently match many different regex patterns in Perl?

前端 未结 8 2127
礼貌的吻别
礼貌的吻别 2020-12-18 07:52

I have a growing list of regular expressions that I am using to parse through log files searching for \"interesting\" error and debug statements. I\'m currently breaking th

相关标签:
8条回答
  • 2020-12-18 08:38

    You might want to take a look at Regexp::Assemble. It's intended to handle exactly this sort of problem.

    Boosted code from the module's synopsis:

    use Regexp::Assemble;
    
    my $ra = Regexp::Assemble->new;
    $ra->add( 'ab+c' );
    $ra->add( 'ab+-' );
    $ra->add( 'a\w\d+' );
    $ra->add( 'a\d+' );
    print $ra->re; # prints a(?:\w?\d+|b+[-c])
    

    You can even slurp your regex collection out of a separate file.

    0 讨论(0)
  • 2020-12-18 08:39

    Your example regular expressions look like they are based mainly on ordinary words and phrases. If that's the case, you might be able to speed things up considerably by pre-filtering the input lines using index, which is much faster than a regular expression. Under such a strategy, every regular expression would have a corresponding non-regex word or phrase for use in the pre-filtering stage. Better still would be to skip the regular expression test entirely, wherever possible: two of your example tests do not require regular expressions and could be done purely with index.

    Here is an illustration of the basic idea:

    use strict;
    use warnings;
    
    my @checks = (
        ['Failed',    qr/Failed in routing out/  ],
        ['failed',    qr/Agent .+ failed/        ],
        ['Not Exist', qr/Record Not Exist in DB/ ],
    );
    my @filter_strings = map { $_->[0] } @checks;
    my @regexes        = map { $_->[1] } @checks;
    
    sub regex {
        my $line = shift;
        for my $reg (@regexes){
            return 1 if $line =~ /$reg/;
        }
        return;
    }
    
    sub pre {
        my $line = shift;
        for my $fs (@filter_strings){
            return 1 if index($line, $fs) > -1;
        }
        return;
    }
    
    my @data = (
        qw(foo bar baz biz buz fubb),
        'Failed in routing out.....',
        'Agent FOO failed miserably',
        'McFly!!! Record Not Exist in DB',
    );
    
    use Benchmark qw(cmpthese);
    cmpthese ( -1, {
        regex => sub { for (@data){ return $_ if(            regex($_)) } },
        pre   => sub { for (@data){ return $_ if(pre($_) and regex($_)) } },
    } );
    

    Output (results with your data might be very different):

                 Rate     regex prefilter
    regex     36815/s        --      -54%
    prefilter 79331/s      115%        --
    
    0 讨论(0)
提交回复
热议问题