Strategies to handle a file with multiple fixed formats

后端 未结 6 900
北海茫月
北海茫月 2021-01-12 22:40

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multipl

6条回答
  •  感动是毒
    2021-01-12 23:15

    This is a good question. Two suggestions occur to me.

    (1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.

    (2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.

    First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.

    And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.

    In unbreak_records.pl.

    Omitting the she-bang and use strict/warnings.

    while (<>){
        chomp;
        print /^\*?\s/ ? ' ' : "\n", $_;
    }
    print "\n";
    

    In add_record_types.pl

    while (<>){
        next unless /\S/;
        my $rt = /^\*/ ?   1 :
                 /^..\// ? 2 : 3;
        print $rt, ' ', $_;
    }
    

    At the command line.

    ./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
    

    Output:

    1 **DEVICE 109523.69142   .981    561A
    2 10/MAY/2010    24.15.30,13.45.03
    3 05:03:01   AB23X  15.67   101325.72 *           14  31.30474 13        0
    3 05:03:15   CR22X  16.72   101325.42 *           14  29.16264 11        0
    3 06:23:51   AW41X  15.67    101323.9 *           14  31.26493219        0
    2 11/MAY/2010    24.07.13,13.44.63
    3 15:57:14   AB23X  15.67   101327.23 *           14  31.30474 13        0
    3 15:59:59   CR22X  16.72   101331.88 *           14  29.16264 11        0
    

    The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.

提交回复
热议问题