Parsing unsorted data from large fixed width text

后端 未结 4 690
时光取名叫无心
时光取名叫无心 2021-01-19 07:09

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.

I have a large fixed width data file that I would like to process into a binary file with a

4条回答
  •  春和景丽
    2021-01-19 07:28

    I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully

    On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.

    Please let me know if I am making any rookie mistakes.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use Data::Dumper;
    
    my $lineArg1 = $ARGV[0];
    open(INFILE, $lineArg1);
    
    my $line;
    my @param_names;
    my @template;
    while ($line = ) {
        chomp $line; #Remove New Line
        if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
           push @param_names, trim($line);
        }
        elsif ($line =~ /^----/) {
            @template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
            $template[-1] = 'A*';
            my $data_start_pos = tell INFILE;
            last; #Reached start of data exit loop
        }
    }
    
    my $size = $#param_names+1;
    my @getType = ((1) x $size);
    my $template = "@template";
    my @lineData;
    my %dataHash;
    my $lineCount = 0;
    while ($line = ) {
        if ($lineCount % 100000 == 0){
            print "On Line: ".$lineCount."\n";
        }
        if ($line =~ /^\d/) { 
            chomp($line);
            @lineData = unpack $template, $line;
            my ($inHeader, $headerIndex) = findStr($lineData[1], @param_names);
            if ($inHeader) { 
                push @{$dataHash{$lineData[1]}{time} }, $lineData[0];
                push @{$dataHash{$lineData[1]}{data} }, $lineData[3];
                if ($getType[$headerIndex]){ # Things that only need written once
                    $dataHash{$lineData[1]}{type}  = $lineData[2];
                    $getType[$headerIndex] = 0;
                }
            }
        }  
    $lineCount ++; 
    } # END WHILE 
    close(INFILE);
    
    print Dumper \%dataHash;
    
    #WRITE BINARY FILE and TOC FILE
    my %convert = (TXT=>sub{pack 'A*', join "\n", @_}, D=>sub{pack 'd*', @_}, UI=>sub{pack 'L*', @_});
    
    open my $binfile, '>:raw', $lineArg1.'.bin';
    open my $tocfile, '>', $lineArg1.'.toc';
    
    for my $param (@param_names){
        my $data = $dataHash{$param};
        my @toc_line = ($param, $data->{type}, tell $binfile );
        print {$binfile} $convert{D}->(@{$data->{time}});
        push @toc_line, tell $binfile;
        print {$binfile} $convert{$data->{type}}->(@{$data->{data}});
        push @toc_line, tell $binfile;
        print {$tocfile} join(',',@toc_line,''),"\n";
    }
    
    sub trim { #Trim leading and trailing white space
      my (@strings) = @_;
      foreach my $string (@strings) {
        $string =~ s/^\s+//;
        $string =~ s/\s+$//;
        chomp ($string);
      } 
      return wantarray ? @strings : $strings[0];
    } # END SUB
    
    sub findStr { #Return TRUE if string is contained in array.
        my $searchStr = shift;
        my $i = 0;
        foreach ( @_ ) {
            if ($_ eq $searchStr){
                return (1,$i);
            }
        $i ++;
        }
        return (0,-1);
    } # END SUB
    

    The output is as follows:

    $VAR1 = {
              'Param 1' => {
                             'time' => [
                                         '1.1',
                                         '3.2',
                                         '5.3'
                                       ],
                             'type' => 'UI',
                             'data' => [
                                         '5',
                                         '10',
                                         '15'
                                       ]
                           },
              'Param 2' => {
                             'time' => [
                                         '4.5',
                                         '6.121'
                                       ],
                             'type' => 'D',
                             'data' => [
                                         '2.1234',
                                         '3.1234'
                                       ]
                           },
              'Param 3' => {
                             'time' => [
                                         '2.23',
                                         '7.56'
                                       ],
                             'type' => 'TXT',
                             'data' => [
                                         'Some Text 1',
                                         'Some Text 2'
                                       ]
                           }
            };
    

    Here is the output TOC File:

    Param 1,UI,0,24,36,
    Param 2,D,36,52,68,
    Param 3,TXT,68,84,107,
    

    Thanks everyone for their help so far! This is an excellent resource!

    EDIT: Added Binary & TOC file writing code.

提交回复
热议问题