Strategies to handle a file with multiple fixed formats

后端未结

关注

 6  906

北海茫月 2021-01-12 22:40

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multipl

6条回答

暗喜 (楼主)

2021-01-12 22:48

Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.

From this, you can infer that @expect is a list of names of formats we expect to find on the current line.

For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:

This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.

This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is

fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
             (?{ $device->{attr1} = $1;
                 @expect = qw< fmt2 >;
               })
          /x,

After processing the input from your question, $device contains

{
  'attr1' => '109523.69142',
  'attr2' => '.981',
  'attr3' => '561A',
  'groups' => [
    {
      'date' => '10/MAY/2010',
      'nnn' => [ '24.15.30', '13.45.03' ],
      'records' => [
        [ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474',  '13', '0' ],
        [ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264',  '11', '0' ],
        [ '06:23:51', 'AW41X', '15.67', '101323.9',  '14', '31.264932', '19', '0' ],
      ],
    },
    {
      'date' => '11/MAY/2010',
      'nnn' => [ '24.07.13', '13.44.63' ],
      'records' => [
        [ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
        [ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
      ],
    }
  ],
}

I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:

Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.

For completeness, a working program demonstrating the result is below.

#! /usr/bin/perl

use warnings;
use strict;
use feature ':5.10';
use re 'eval';

*ARGV = *DATA;

my $device;
my $record;
my @expect = qw/ fmt1 /;
my %pattern;
%pattern = (
  fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
               (?{ $device->{attr1} = $1;
                   @expect = qw< fmt2 >;
                 })
            /x,

  fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
               (?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
                   @expect = qw< fmt3 >;
                 })
            /x,

  # e.g., 10/MAY/2010    24.15.30,13.45.03
  fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
               (?{ my($date,$nnns) = ($1,$2);
                   push @{ $device->{groups} } =>
                     { nnn  => [ split m|,| => $nnns ],
                       date => $date };
                   @expect = qw< fmt4 >;
                 })
            /x,

  # e.g., 05:03:01   AB23X  15.67   101325.72
  fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
               (\S+) \s+ (\S+) \s+ (\S+)
               \s*$
               (?{ push @{ $device->{groups}[-1]{records} } =>
                        [ $1, $2, $3, $4 ];
                   @expect = qw< fmt4 fmt5 >;
                 })
            /x,

  # e.g., *           14  31.30474 13        0
  fmt5 => qr/^\* \s+ (\d+) \s+
              # tricky: possibly no whitespace after 9-char float
              ((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
              (\d+) \s+ (\d+)
              \s*$
              (?{ push @{ $device->{groups}[-1]{records}[-1] } =>
                        $1, $2, $3, $4;
                  @expect = qw< fmt4 fmt3 fmt2 >;
                })
            /x,
);

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;

__DATA__
**DEVICE 109523.69142
  .981    561A
10/MAY/2010    24.15.30,13.45.03
05:03:01   AB23X  15.67   101325.72
*           14  31.30474 13        0
05:03:15   CR22X  16.72   101325.42
*           14  29.16264 11        0
06:23:51   AW41X  15.67    101323.9
*           14  31.26493219        0
11/MAY/2010    24.07.13,13.44.63
15:57:14   AB23X  15.67   101327.23
*           14  31.30474 13        0
15:59:59   CR22X  16.72   101331.88
*           14  29.16264 11        0

0 讨论(0)

查看其它6个回答