I have a growing list of regular expressions that I am using to parse through log files searching for \"interesting\" error and debug statements. I\'m currently breaking th
From perlfaq6's answer to How do I efficiently match many regular expressions at once?
How do I efficiently match many regular expressions at once?
( contributed by brian d foy )
Avoid asking Perl to compile a regular expression every time you want to match it. In this example, perl must recompile the regular expression for every iteration of the foreach loop since it has no way to know what $pattern will be.
@patterns = qw( foo bar baz );
LINE: while( <DATA> )
{
foreach $pattern ( @patterns )
{
if( /\b$pattern\b/i )
{
print;
next LINE;
}
}
}
The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it. When you use the pre-compiled version of the regex, perl does less work. In this example, I inserted a map to turn each pattern into its pre-compiled form. The rest of the script is the same, but faster.
@patterns = map { qr/\b$_\b/i } qw( foo bar baz );
LINE: while( <> )
{
foreach $pattern ( @patterns )
{
if( /$pattern/ )
{
print;
next LINE;
}
}
}
In some cases, you may be able to make several patterns into a single regular expression. Beware of situations that require backtracking though.
$regex = join '|', qw( foo bar baz );
LINE: while( <> )
{
print if /\b(?:$regex)\b/i;
}
For more details on regular expression efficiency, see Mastering Regular Expressions by Jeffrey Freidl. He explains how regular expressions engine work and why some patterns are surprisingly inefficient. Once you understand how perl applies regular expressions, you can tune them for individual situations.
You can combine your regexes with the alternation operator |
, as in: /pattern1|pattern2|pattern3/
Obviously, it won't be very maintainable if you put all of them in a single line, but you've got options to mitigate that.
/x
regex modifier to space them nicely, one per line. A word of caution if you choose this direction: you'll have to explicitely specify the space characters you expect, otherwise they'd be be ignored because of the /x
.You can generate your regular expression at run-time, by combining individual sources. Something like this (untested):
my $regex = join '|', @sources;
while (<>) {
next unless /$regex/o;
say;
}
This is handled easily with Perl 5.10
use strict;
use warnings;
use 5.10.1;
my @matches = (
qr'Failed in routing out',
qr'Agent .+ failed',
qr'Record Not Exist in DB'
);
# ...
sub parse{
my($filename) = @_;
open my $file, '<', $filename;
while( my $line = <$file> ){
chomp $line;
# you could use given/when
given( $line ){
when( @matches ){
#...
}
}
# or smartmatch
if( $line ~~ @matches ){
# ...
}
}
}
You could use the new Smart-Match operator ~~.
if( $line ~~ @matches ){ ... }
Or you can use given/when. Which performs the same as using the Smart-Match operator.
given( $line ){
when( @matches ){
#...
}
}
You might want to get rid of the large if statement:
my @interesting = (
qr/Failed in routing out/,
qr/Agent .+ failed/,
qr/Record Not Exist in DB/,
);
return unless $line =~ $_ for @interesting;
although I cannot promise this will improve anything w/o benchmarking with real data.
It might help if you can anchor your patterns at the beginning so they can fail more quickly.
One possible solution is to let the regex state machine do the checking of alternatives for you. You'll have to benchmark to see if the result is noticeably more efficient, but it will certainly be more maintainable.
First, you'd maintain a file containing one pattern of interest per line.
Failed in routing out
Agent .+ failed
Record Not Exist in DB
Then you'd read in that file at the beginning of your run, and construct a large regular expression using the "alternative" operator, "|
"
open(PATTERNS,"<foo.txt") or die $!;
@patterns = <PATTERNS>;
close PATTERNS or die $!;
chomp @patterns;
$matcher = join('|', @patterns);
while (<MYLOG>) {
print if $_ =~ $matcher;
}
Maybe something like:
my @interesting = (
qr/Failed in routing out/,
qr/Agent .+ failed/,
qr/Record Not Exist in DB/,
);
...
for my $re (@interesting) {
if ($line =~ /$re/) {
print $line;
last;
}
}
You can try joining all your patterns with "|" to make one regex. That may or may not be faster.