Matching multiple lines of poorly formatted text in Perl

点点圈 提交于 2019-12-08 07:53:01

问题


I have data format coming like below from an external program and need to get the first 4 fields(Text, username, number and timestamp) of each line. Please note Hello line1 is one field and second one is user name. The format is output could be single line like line1 below or three lines like line2 or two lines like line4 below. And also the format can be mixed like below(not single line always or double etc)

Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                         Line2FirstName-LastName       8       7/17/2015 1:15 PM 

Line2Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

Hello Line4

                         Line4FirstName-LastName       8       9/17/2015 1:20 PM

Screen shot of above in a editor

I was able to get Multline RegEx with the help of this question: Perl multiline regex for first 3 individual items

Thanks to @GsusRecovery!

Since i am reading line by line output i don't think i can take advantage of the multi line RegEx by reading singe line. Is it possible to read only single line if the format is in one line or read 2 lines if it is spread out in 2 or 3 lines if it is spread out in 3 lines?

Or is it only better to read each and every line and backtrack depending on double line or triple line format.

Please suggest.


回答1:


UPDATE: i've changed the script to accept stdin and put it in @output_lines as array (to emulate the input situation of @sureng)

I've wrapped the regex in a line accumulator that recognize the hour as a closing pattern. In this way you can parse the output line by line and yet apply the regex.

#!/usr/bin/perl

use strict;
use warnings;

my ($accumulator,$chat,$username,$chars,$timestamp);

my @output_lines = <STDIN>;

foreach (@output_lines)
{
    $accumulator .= $_;

   ($chat,$username,$chars,$timestamp) = $accumulator =~ m/(?im)^\s*(.+)\s+(\w+[-,\.]\w+)\s+(\d+)\s+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)\s*$/;
    $chat =~ s/\s+$// if $chat;  #remove trailing spaces

    if ( $accumulator =~ /(?i)([0-2]?\d:[0-5]?\d\s?[ap]m)/ ) {
        print "SECTION matched\n";
        print "-"x80,"\n";
        print "$accumulator";
        print "-"x80,"\n";
        print "chat -> ${chat}\n";
        print "username -> ${username}\n";
        print "chars -> ${chars}\n";
        print "timestamp -> ${timestamp}\n\n";
        $accumulator = '';  # reset the line accumulator
    }
}

Try the solution online (with your example provided as stdin) here.

In your shell, given the script above and this input file:

# MultiLineInput.txt
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                     Line2FirstName-LastName       8       7/17/2015 1:15 PM 
Line2Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

Hello Line4

                     Line4FirstName-LastName       8       9/17/2015 1:20 PM

You can simply call:

cat MultiLineInput.txt | StreamRegex.pl

If it works as expected you can substitute the cat command with your source.

NB: this approach is needed if you process a stream or if your file is bigger than the volatile memory of the system (and so you want to process it as a stream) but, that said, it works in any case.




回答2:


It's best to use a single approach rather than switch on each line as there is no indication of when single/multi lines can happen beforehand. Because you have fixed formats for (int) and (date), just use a multi-line regex pattern which would match something like this: (pseudo-regex code)

 \s+    (.*)   \s+  (.*)  (\d+) (\d+\/\d+\/\d+ \d+\:\d+ [AP]M)$
 space text  space  name  int   date

Don't forget to use the /m for multiline matching. Because single/multi line modes are virtually identical besides the \n and extra spacing, the same pattern can be used in all cases.



来源:https://stackoverflow.com/questions/33093262/matching-multiple-lines-of-poorly-formatted-text-in-perl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!