Matching multiple lines of poorly formatted text in Perl

一笑奈何 提交于 2019-12-08 12:56:28

UPDATE: i've changed the script to accept stdin and put it in @output_lines as array (to emulate the input situation of @sureng)

I've wrapped the regex in a line accumulator that recognize the hour as a closing pattern. In this way you can parse the output line by line and yet apply the regex.

#!/usr/bin/perl

use strict;
use warnings;

my ($accumulator,$chat,$username,$chars,$timestamp);

my @output_lines = <STDIN>;

foreach (@output_lines)
{
    $accumulator .= $_;

   ($chat,$username,$chars,$timestamp) = $accumulator =~ m/(?im)^\s*(.+)\s+(\w+[-,\.]\w+)\s+(\d+)\s+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)\s*$/;
    $chat =~ s/\s+$// if $chat;  #remove trailing spaces

    if ( $accumulator =~ /(?i)([0-2]?\d:[0-5]?\d\s?[ap]m)/ ) {
        print "SECTION matched\n";
        print "-"x80,"\n";
        print "$accumulator";
        print "-"x80,"\n";
        print "chat -> ${chat}\n";
        print "username -> ${username}\n";
        print "chars -> ${chars}\n";
        print "timestamp -> ${timestamp}\n\n";
        $accumulator = '';  # reset the line accumulator
    }
}

Try the solution online (with your example provided as stdin) here.

In your shell, given the script above and this input file:

# MultiLineInput.txt
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                     Line2FirstName-LastName       8       7/17/2015 1:15 PM 
Line2Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

Hello Line4

                     Line4FirstName-LastName       8       9/17/2015 1:20 PM

You can simply call:

cat MultiLineInput.txt | StreamRegex.pl

If it works as expected you can substitute the cat command with your source.

NB: this approach is needed if you process a stream or if your file is bigger than the volatile memory of the system (and so you want to process it as a stream) but, that said, it works in any case.

It's best to use a single approach rather than switch on each line as there is no indication of when single/multi lines can happen beforehand. Because you have fixed formats for (int) and (date), just use a multi-line regex pattern which would match something like this: (pseudo-regex code)

 \s+    (.*)   \s+  (.*)  (\d+) (\d+\/\d+\/\d+ \d+\:\d+ [AP]M)$
 space text  space  name  int   date

Don't forget to use the /m for multiline matching. Because single/multi line modes are virtually identical besides the \n and extra spacing, the same pattern can be used in all cases.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!