Fast alternative to grep -f

前端未结

关注

 8  1278

file.contain.query.txt

ENST001

ENST002

ENST003

file.to.search.in.txt

ENST001  90

ENST002  80

ENST004  50

相关标签:

8条回答

旧时难觅i

2020-12-11 16:35

This Perl code may helps you:

use strict;
open my $file1, "<", "file.contain.query.txt" or die $!;
open my $file2, "<", "file.to.search.in.txt" or die $!;

my %KEYS = ();
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file

while(my $line=<$file1>) {
    chomp $line;
    $KEYS{$line} = 1;
}

while(my $line=<$file2>) {
    if( $line =~ /(\w+)\s+(\d+)/ ) {
        print "$1 $2\n" if $KEYS{$1};
    }
}

close $file1;
close $file2;

0 讨论(0)

说谎

2020-12-11 16:37

Mysql:

Importing the data into Mysql or similar will provide an immense improvement. Will this be feasible ? You could see results in a few seconds.

mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt 

# but first you need to create the tables like this (only once off)

create table contains (
   keyword   varchar(255)
   , primary key (keyword)
);

create table search (
   keyword varchar(255)
   ,num bigint
   ,key (keyword)
);

# and load the data in:

load data infile 'file.contain.query.txt' 
    into table contains fields terminated by "add column separator here";
load data infile 'file.to.search.in.txt' 
    into table search fields terminated by "add column separator here";

0 讨论(0)

不思量自难忘°

2020-12-11 16:40

If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:

#!/usr/bin/env perl
use strict;
use warnings;

# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
    chomp $_;
    $keyring->{$_} = 1;
}
close KEYS;

# look up key from each line of standard input
while (<STDIN>) {
    chomp $_;
    my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
    if (defined $keyring->{$key}) { print "$_\n"; }
}

You'd use it like so:

lookup.pl < file.to.search.txt

A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.

0 讨论(0)

慢半拍i

2020-12-11 16:40
If the files are already sorted:
```
join file1 file2
```
if not:
```
join <(sort file1) <(sort file2)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-11 16:43
This may be a little dated, but is tailor-made for simple UNIX utilities. Given:
- keys are fixed-length (here 7 chars)
- files are sorted (true in the example) allowing the use of fast merge sort
Then:
```
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7

ENST002  80

ENST001  90
```
Variants:

To strip the number printed after the key, remove tac command:
```
$ sort -m file.contain.query.txt file.to.search.in.txt | uniq -d -w7
```
To keep sorted order, add an extra tac command at the end:
```
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7 | tac
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

我在风中等你

2020-12-11 16:55

use strict;
use warings;

system("sort file.contain.query.txt > qsorted.txt");
system("sort file.to.search.in.txt  > dsorted.txt");

open (QFILE, "<qsorted.txt") or die();
open (DFILE, "<dsorted.txt") or die();


while (my $qline = <QFILE>) {
  my ($queryid) = ($qline =~ /ENST(\d+)/); 
  while (my $dline = <DFILE>) {
    my ($dataid) = ($dline =~ /ENST(\d+)/);
    if ($dataid == $queryid)   { print $qline; }
    elsif ($dataid > $queryid) { break; } 
  }
}

0 讨论(0)

1 2 下一页