I need to merge two files into a new file.
The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren\'t sorted. The seco
Your technique is extremely inefficient for a few reasons.
The first can be mitigated by doing the reading and splitting yourself, but the latter is always going to be a problem. The rule of thumb is to avoid pulling big hunks of data into memory. It'll hog all the memory and probably cause it to swap to disk and slow down waaaay down, especially if you're using a spinning disk.
Instead, there's various "on disk hashes" you can use with modules like GDBM_File or BerkleyDB.
But really there's no reason to mess around with them because we have SQLite and it does everything they do faster and better.
Create a table in SQLite.
create table imported (
id integer,
value text
);
Import your file using the sqlite shell's .import
adjusting for your format using the .mode
and .separator
.
sqlite> create table imported (
...> id integer,
...> value text
...> );
sqlite> .mode list
sqlite> .separator |
sqlite> .import test.data imported
sqlite> .mode column
sqlite> select * from imported;
12345 NITIN
12346 NITINfoo
2398 bar
9823 baz
And now you, and anyone else who has to work with the data, can do whatever you like with it in efficient, flexible SQL. Even if it takes a while to import, you can go do something else while it does.
I'd use sort
to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.
perl -e'
sub get {
my $fh = shift;
my $line = <$fh>;
return () if !defined($line);
chomp($line);
return split(/\|/, $line);
}
sub main {
@ARGV == 2
or die("usage\n");
open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);
my ($key1, $val1) = get($fh1) or return;
my ($key2, $val2) = get($fh2) or return;
while (1) {
if ($key1 < $key2) { ($key1, $val1) = get($fh1) or return; }
elsif ($key1 > $key2) { ($key2, $val2) = get($fh2) or return; }
else {
print("$key1,$val1,$val2\n");
($key1, $val1) = get($fh1) or return;
($key2, $val2) = get($fh2) or return;
}
}
}
main();
' file1 file2 >file
For 10,000,000 records in each file, this took 37 seconds on a slowish machine.
$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1
$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2
$ time perl -e'...' file1 file2 >file
real 0m37.030s
user 0m38.261s
sys 0m1.750s
Alternatively, one could dump the data in database and letting it handle the details.
sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
FROM file1
JOIN file2
ON file2.id = file1.id;
.exit
EOI
But you pay for the flexbility. This took twice as long.
real 1m14.065s
user 1m11.009s
sys 0m2.550s
Note: I originally had CREATE INDEX file2_id ON file2 ( id );
after the .import
commands, but removing it greatly helped performance..