very huge assosiative array in perl

前端 未结 2 600
再見小時候
再見小時候 2020-12-17 07:26

I need to merge two files into a new file.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren\'t sorted. The seco

相关标签:
2条回答
  • 2020-12-17 07:58

    Your technique is extremely inefficient for a few reasons.

    • Tying is extremely slow.
    • You're pulling everything into memory.

    The first can be mitigated by doing the reading and splitting yourself, but the latter is always going to be a problem. The rule of thumb is to avoid pulling big hunks of data into memory. It'll hog all the memory and probably cause it to swap to disk and slow down waaaay down, especially if you're using a spinning disk.

    Instead, there's various "on disk hashes" you can use with modules like GDBM_File or BerkleyDB.

    But really there's no reason to mess around with them because we have SQLite and it does everything they do faster and better.


    Create a table in SQLite.

    create table imported (
        id integer,
        value text
    );
    

    Import your file using the sqlite shell's .import adjusting for your format using the .mode and .separator.

    sqlite>     create table imported (
       ...>         id integer,
       ...>         value text
       ...>     );
    sqlite> .mode list
    sqlite> .separator |
    sqlite> .import test.data imported
    sqlite> .mode column
    sqlite> select * from imported;
    12345       NITIN     
    12346       NITINfoo  
    2398        bar       
    9823        baz     
    

    And now you, and anyone else who has to work with the data, can do whatever you like with it in efficient, flexible SQL. Even if it takes a while to import, you can go do something else while it does.

    0 讨论(0)
  • 2020-12-17 08:05

    I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.

    perl -e'
       sub get {
          my $fh = shift;
          my $line = <$fh>;
          return () if !defined($line);
    
          chomp($line);
          return split(/\|/, $line);
       }
    
       sub main {
          @ARGV == 2
             or die("usage\n");
    
          open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
          open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);
    
          my ($key1, $val1) = get($fh1)  or return;
          my ($key2, $val2) = get($fh2)  or return;
    
          while (1) {
             if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
             elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
             else {
                print("$key1,$val1,$val2\n");
                ($key1, $val1) = get($fh1)  or return;
                ($key2, $val2) = get($fh2)  or return;
             }
          }
       }
    
       main();
    ' file1 file2 >file
    

    For 10,000,000 records in each file, this took 37 seconds on a slowish machine.

    $ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1
    
    $ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2
    
    $ time perl -e'...' file1 file2 >file
    real    0m37.030s
    user    0m38.261s
    sys     0m1.750s
    

    Alternatively, one could dump the data in database and letting it handle the details.

    sqlite3 <<'EOI'
    CREATE TABLE file1 ( id INTEGER, value TEXT );
    CREATE TABLE file2 ( id INTEGER, value TEXT );
    .mode list
    .separator |
    .import file1 file1
    .import file2 file2
    .output file
    SELECT file1.id || "," || file1.value || "," || file2.value
      FROM file1
      JOIN file2
        ON file2.id = file1.id;
    .exit
    EOI
    

    But you pay for the flexbility. This took twice as long.

    real    1m14.065s
    user    1m11.009s
    sys     0m2.550s
    

    Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..

    0 讨论(0)
提交回复
热议问题