Strategies to handle a file with multiple fixed formats

后端 未结 6 906
北海茫月
北海茫月 2021-01-12 22:40

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multipl

6条回答
  •  暗喜
    暗喜 (楼主)
    2021-01-12 22:48

    Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:

    while (<>) {
      given($_) {
        when (@{[ map $pattern{$_}, @expect]}) {}
        default {
          die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
        }
      }
    }
    

    As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.

    From this, you can infer that @expect is a list of names of formats we expect to find on the current line.

    For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:

    This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.

    This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is

    fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
                 (?{ $device->{attr1} = $1;
                     @expect = qw< fmt2 >;
                   })
              /x,
    

    After processing the input from your question, $device contains

    {
      'attr1' => '109523.69142',
      'attr2' => '.981',
      'attr3' => '561A',
      'groups' => [
        {
          'date' => '10/MAY/2010',
          'nnn' => [ '24.15.30', '13.45.03' ],
          'records' => [
            [ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474',  '13', '0' ],
            [ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264',  '11', '0' ],
            [ '06:23:51', 'AW41X', '15.67', '101323.9',  '14', '31.264932', '19', '0' ],
          ],
        },
        {
          'date' => '11/MAY/2010',
          'nnn' => [ '24.07.13', '13.44.63' ],
          'records' => [
            [ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
            [ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
          ],
        }
      ],
    }
    

    I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:

    Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.


    For completeness, a working program demonstrating the result is below.

    #! /usr/bin/perl
    
    use warnings;
    use strict;
    use feature ':5.10';
    use re 'eval';
    
    *ARGV = *DATA;
    
    my $device;
    my $record;
    my @expect = qw/ fmt1 /;
    my %pattern;
    %pattern = (
      fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
                   (?{ $device->{attr1} = $1;
                       @expect = qw< fmt2 >;
                     })
                /x,
    
      fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
                   (?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
                       @expect = qw< fmt3 >;
                     })
                /x,
    
      # e.g., 10/MAY/2010    24.15.30,13.45.03
      fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
                   (?{ my($date,$nnns) = ($1,$2);
                       push @{ $device->{groups} } =>
                         { nnn  => [ split m|,| => $nnns ],
                           date => $date };
                       @expect = qw< fmt4 >;
                     })
                /x,
    
      # e.g., 05:03:01   AB23X  15.67   101325.72
      fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
                   (\S+) \s+ (\S+) \s+ (\S+)
                   \s*$
                   (?{ push @{ $device->{groups}[-1]{records} } =>
                            [ $1, $2, $3, $4 ];
                       @expect = qw< fmt4 fmt5 >;
                     })
                /x,
    
      # e.g., *           14  31.30474 13        0
      fmt5 => qr/^\* \s+ (\d+) \s+
                  # tricky: possibly no whitespace after 9-char float
                  ((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
                  (\d+) \s+ (\d+)
                  \s*$
                  (?{ push @{ $device->{groups}[-1]{records}[-1] } =>
                            $1, $2, $3, $4;
                      @expect = qw< fmt4 fmt3 fmt2 >;
                    })
                /x,
    );
    
    while (<>) {
      given($_) {
        when (@{[ map $pattern{$_}, @expect]}) {}
        default {
          die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
        }
      }
    }
    
    use Data::Dumper;
    $Data::Dumper::Terse = $Data::Dumper::Indent = 1;
    print Dumper $device;
    
    __DATA__
    **DEVICE 109523.69142
      .981    561A
    10/MAY/2010    24.15.30,13.45.03
    05:03:01   AB23X  15.67   101325.72
    *           14  31.30474 13        0
    05:03:15   CR22X  16.72   101325.42
    *           14  29.16264 11        0
    06:23:51   AW41X  15.67    101323.9
    *           14  31.26493219        0
    11/MAY/2010    24.07.13,13.44.63
    15:57:14   AB23X  15.67   101327.23
    *           14  31.30474 13        0
    15:59:59   CR22X  16.72   101331.88
    *           14  29.16264 11        0
    

提交回复
热议问题