I need to merge two files into a new file.
The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren\'t sorted. The seco
I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.
perl -e'
sub get {
my $fh = shift;
my $line = <$fh>;
return () if !defined($line);
chomp($line);
return split(/\|/, $line);
}
sub main {
@ARGV == 2
or die("usage\n");
open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);
my ($key1, $val1) = get($fh1) or return;
my ($key2, $val2) = get($fh2) or return;
while (1) {
if ($key1 < $key2) { ($key1, $val1) = get($fh1) or return; }
elsif ($key1 > $key2) { ($key2, $val2) = get($fh2) or return; }
else {
print("$key1,$val1,$val2\n");
($key1, $val1) = get($fh1) or return;
($key2, $val2) = get($fh2) or return;
}
}
}
main();
' file1 file2 >file
For 10,000,000 records in each file, this took 37 seconds on a slowish machine.
$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1
$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2
$ time perl -e'...' file1 file2 >file
real 0m37.030s
user 0m38.261s
sys 0m1.750s
Alternatively, one could dump the data in database and letting it handle the details.
sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
FROM file1
JOIN file2
ON file2.id = file1.id;
.exit
EOI
But you pay for the flexbility. This took twice as long.
real 1m14.065s
user 1m11.009s
sys 0m2.550s
Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..