very huge assosiative array in perl

前端 未结 2 601
再見小時候
再見小時候 2020-12-17 07:26

I need to merge two files into a new file.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren\'t sorted. The seco

2条回答
  •  暖寄归人
    2020-12-17 08:05

    I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.

    perl -e'
       sub get {
          my $fh = shift;
          my $line = <$fh>;
          return () if !defined($line);
    
          chomp($line);
          return split(/\|/, $line);
       }
    
       sub main {
          @ARGV == 2
             or die("usage\n");
    
          open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
          open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);
    
          my ($key1, $val1) = get($fh1)  or return;
          my ($key2, $val2) = get($fh2)  or return;
    
          while (1) {
             if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
             elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
             else {
                print("$key1,$val1,$val2\n");
                ($key1, $val1) = get($fh1)  or return;
                ($key2, $val2) = get($fh2)  or return;
             }
          }
       }
    
       main();
    ' file1 file2 >file
    

    For 10,000,000 records in each file, this took 37 seconds on a slowish machine.

    $ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1
    
    $ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2
    
    $ time perl -e'...' file1 file2 >file
    real    0m37.030s
    user    0m38.261s
    sys     0m1.750s
    

    Alternatively, one could dump the data in database and letting it handle the details.

    sqlite3 <<'EOI'
    CREATE TABLE file1 ( id INTEGER, value TEXT );
    CREATE TABLE file2 ( id INTEGER, value TEXT );
    .mode list
    .separator |
    .import file1 file1
    .import file2 file2
    .output file
    SELECT file1.id || "," || file1.value || "," || file2.value
      FROM file1
      JOIN file2
        ON file2.id = file1.id;
    .exit
    EOI
    

    But you pay for the flexbility. This took twice as long.

    real    1m14.065s
    user    1m11.009s
    sys     0m2.550s
    

    Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..

提交回复
热议问题