This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).
I have to deal with files where multipl
This is a good question. Two suggestions occur to me.
(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.
(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.
First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.
And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.
Omitting the she-bang and use strict/warnings.
while (<>){
chomp;
print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";
while (<>){
next unless /\S/;
my $rt = /^\*/ ? 1 :
/^..\// ? 2 : 3;
print $rt, ' ', $_;
}
./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
1 **DEVICE 109523.69142 .981 561A
2 10/MAY/2010 24.15.30,13.45.03
3 05:03:01 AB23X 15.67 101325.72 * 14 31.30474 13 0
3 05:03:15 CR22X 16.72 101325.42 * 14 29.16264 11 0
3 06:23:51 AW41X 15.67 101323.9 * 14 31.26493219 0
2 11/MAY/2010 24.07.13,13.44.63
3 15:57:14 AB23X 15.67 101327.23 * 14 31.30474 13 0
3 15:59:59 CR22X 16.72 101331.88 * 14 29.16264 11 0
The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.