Strategies to handle a file with multiple fixed formats

后端未结

关注

 6  900

北海茫月 2021-01-12 22:40

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multipl

6条回答

感动是毒 (楼主)

2021-01-12 23:15
This is a good question. Two suggestions occur to me.

(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.

(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.

First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.

And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.

In unbreak_records.pl.

Omitting the she-bang and use strict/warnings.
```
while (<>){
    chomp;
    print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";
```
In add_record_types.pl
```
while (<>){
    next unless /\S/;
    my $rt = /^\*/ ?   1 :
             /^..\// ? 2 : 3;
    print $rt, ' ', $_;
}
```
At the command line.
```
./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
```
Output:
```
1 **DEVICE 109523.69142   .981    561A
2 10/MAY/2010    24.15.30,13.45.03
3 05:03:01   AB23X  15.67   101325.72 *           14  31.30474 13        0
3 05:03:15   CR22X  16.72   101325.42 *           14  29.16264 11        0
3 06:23:51   AW41X  15.67    101323.9 *           14  31.26493219        0
2 11/MAY/2010    24.07.13,13.44.63
3 15:57:14   AB23X  15.67   101327.23 *           14  31.30474 13        0
3 15:59:59   CR22X  16.72   101331.88 *           14  29.16264 11        0
```
The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

Strategies to handle a file with multiple fixed formats

In unbreak_records.pl.

In add_record_types.pl

At the command line.

Output: