file.contain.query.txt
ENST001
ENST002
ENST003
file.to.search.in.txt
ENST001 90
ENST002 80
ENST004 50
This Perl code may helps you:
use strict;
open my $file1, "<", "file.contain.query.txt" or die $!;
open my $file2, "<", "file.to.search.in.txt" or die $!;
my %KEYS = ();
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file
while(my $line=<$file1>) {
chomp $line;
$KEYS{$line} = 1;
}
while(my $line=<$file2>) {
if( $line =~ /(\w+)\s+(\d+)/ ) {
print "$1 $2\n" if $KEYS{$1};
}
}
close $file1;
close $file2;
Mysql:
Importing the data into Mysql or similar will provide an immense improvement. Will this be feasible ? You could see results in a few seconds.
mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt
# but first you need to create the tables like this (only once off)
create table contains (
keyword varchar(255)
, primary key (keyword)
);
create table search (
keyword varchar(255)
,num bigint
,key (keyword)
);
# and load the data in:
load data infile 'file.contain.query.txt'
into table contains fields terminated by "add column separator here";
load data infile 'file.to.search.in.txt'
into table search fields terminated by "add column separator here";
If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:
#!/usr/bin/env perl
use strict;
use warnings;
# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
chomp $_;
$keyring->{$_} = 1;
}
close KEYS;
# look up key from each line of standard input
while (<STDIN>) {
chomp $_;
my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
if (defined $keyring->{$key}) { print "$_\n"; }
}
You'd use it like so:
lookup.pl < file.to.search.txt
A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.
If the files are already sorted:
join file1 file2
if not:
join <(sort file1) <(sort file2)
This may be a little dated, but is tailor-made for simple UNIX utilities. Given:
Then:
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7
ENST002 80
ENST001 90
Variants:
To strip the number printed after the key, remove tac command:
$ sort -m file.contain.query.txt file.to.search.in.txt | uniq -d -w7
To keep sorted order, add an extra tac command at the end:
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7 | tac
use strict;
use warings;
system("sort file.contain.query.txt > qsorted.txt");
system("sort file.to.search.in.txt > dsorted.txt");
open (QFILE, "<qsorted.txt") or die();
open (DFILE, "<dsorted.txt") or die();
while (my $qline = <QFILE>) {
my ($queryid) = ($qline =~ /ENST(\d+)/);
while (my $dline = <DFILE>) {
my ($dataid) = ($dline =~ /ENST(\d+)/);
if ($dataid == $queryid) { print $qline; }
elsif ($dataid > $queryid) { break; }
}
}