very huge assosiative array in perl

前端未结

关注

 2  601

再見小時候 2020-12-17 07:26

I need to merge two files into a new file.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren\'t sorted. The seco

2条回答

暖寄归人 (楼主)

2020-12-17 08:05

I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.

perl -e'
   sub get {
      my $fh = shift;
      my $line = <$fh>;
      return () if !defined($line);

      chomp($line);
      return split(/\|/, $line);
   }

   sub main {
      @ARGV == 2
         or die("usage\n");

      open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
      open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);

      my ($key1, $val1) = get($fh1)  or return;
      my ($key2, $val2) = get($fh2)  or return;

      while (1) {
         if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
         elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
         else {
            print("$key1,$val1,$val2\n");
            ($key1, $val1) = get($fh1)  or return;
            ($key2, $val2) = get($fh2)  or return;
         }
      }
   }

   main();
' file1 file2 >file

For 10,000,000 records in each file, this took 37 seconds on a slowish machine.

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2

$ time perl -e'...' file1 file2 >file
real    0m37.030s
user    0m38.261s
sys     0m1.750s

Alternatively, one could dump the data in database and letting it handle the details.

sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
  FROM file1
  JOIN file2
    ON file2.id = file1.id;
.exit
EOI

But you pay for the flexbility. This took twice as long.

real    1m14.065s
user    1m11.009s
sys     0m2.550s

Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..

0 讨论(0)

查看其它2个回答