very huge assosiative array in perl

前端未结

关注

 2  602

I need to merge two files into a new file.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren\'t sorted. The seco

相关标签:

2条回答

说谎

2020-12-17 07:58
Your technique is extremely inefficient for a few reasons.
- Tying is extremely slow.
- You're pulling everything into memory.
The first can be mitigated by doing the reading and splitting yourself, but the latter is always going to be a problem. The rule of thumb is to avoid pulling big hunks of data into memory. It'll hog all the memory and probably cause it to swap to disk and slow down waaaay down, especially if you're using a spinning disk.

Instead, there's various "on disk hashes" you can use with modules like GDBM_File or BerkleyDB.

But really there's no reason to mess around with them because we have SQLite and it does everything they do faster and better.

Create a table in SQLite.
```
create table imported (
    id integer,
    value text
);
```
Import your file using the sqlite shell's .import adjusting for your format using the .mode and .separator.
```
sqlite>     create table imported (
   ...>         id integer,
   ...>         value text
   ...>     );
sqlite> .mode list
sqlite> .separator |
sqlite> .import test.data imported
sqlite> .mode column
sqlite> select * from imported;
12345       NITIN     
12346       NITINfoo  
2398        bar       
9823        baz     
```
And now you, and anyone else who has to work with the data, can do whatever you like with it in efficient, flexible SQL. Even if it takes a while to import, you can go do something else while it does.
0 讨论(0)
发布评论:

提交评论
- 加载中...

暖寄归人

2020-12-17 08:05

I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.

perl -e'
   sub get {
      my $fh = shift;
      my $line = <$fh>;
      return () if !defined($line);

      chomp($line);
      return split(/\|/, $line);
   }

   sub main {
      @ARGV == 2
         or die("usage\n");

      open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
      open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);

      my ($key1, $val1) = get($fh1)  or return;
      my ($key2, $val2) = get($fh2)  or return;

      while (1) {
         if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
         elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
         else {
            print("$key1,$val1,$val2\n");
            ($key1, $val1) = get($fh1)  or return;
            ($key2, $val2) = get($fh2)  or return;
         }
      }
   }

   main();
' file1 file2 >file

For 10,000,000 records in each file, this took 37 seconds on a slowish machine.

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2

$ time perl -e'...' file1 file2 >file
real    0m37.030s
user    0m38.261s
sys     0m1.750s

Alternatively, one could dump the data in database and letting it handle the details.

sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
  FROM file1
  JOIN file2
    ON file2.id = file1.id;
.exit
EOI

But you pay for the flexbility. This took twice as long.

real    1m14.065s
user    1m11.009s
sys     0m2.550s

Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..

0 讨论(0)