Fast alternative to grep -f

前端 未结 8 1271
春和景丽
春和景丽 2020-12-11 16:00

file.contain.query.txt

ENST001

ENST002

ENST003

file.to.search.in.txt

ENST001  90

ENST002  80

ENST004  50
相关标签:
8条回答
  • 2020-12-11 16:35

    This Perl code may helps you:

    use strict;
    open my $file1, "<", "file.contain.query.txt" or die $!;
    open my $file2, "<", "file.to.search.in.txt" or die $!;
    
    my %KEYS = ();
    # Hash %KEYS marks the filtered keys by "file.contain.query.txt" file
    
    while(my $line=<$file1>) {
        chomp $line;
        $KEYS{$line} = 1;
    }
    
    while(my $line=<$file2>) {
        if( $line =~ /(\w+)\s+(\d+)/ ) {
            print "$1 $2\n" if $KEYS{$1};
        }
    }
    
    close $file1;
    close $file2;
    
    0 讨论(0)
  • 2020-12-11 16:37

    Mysql:

    Importing the data into Mysql or similar will provide an immense improvement. Will this be feasible ? You could see results in a few seconds.

    mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt 
    
    # but first you need to create the tables like this (only once off)
    
    create table contains (
       keyword   varchar(255)
       , primary key (keyword)
    );
    
    create table search (
       keyword varchar(255)
       ,num bigint
       ,key (keyword)
    );
    
    # and load the data in:
    
    load data infile 'file.contain.query.txt' 
        into table contains fields terminated by "add column separator here";
    load data infile 'file.to.search.in.txt' 
        into table search fields terminated by "add column separator here";
    
    0 讨论(0)
  • 2020-12-11 16:40

    If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    # build hash table of keys
    my $keyring;
    open KEYS, "< file.contain.query.txt";
    while (<KEYS>) {
        chomp $_;
        $keyring->{$_} = 1;
    }
    close KEYS;
    
    # look up key from each line of standard input
    while (<STDIN>) {
        chomp $_;
        my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
        if (defined $keyring->{$key}) { print "$_\n"; }
    }
    

    You'd use it like so:

    lookup.pl < file.to.search.txt
    

    A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.

    0 讨论(0)
  • 2020-12-11 16:40

    If the files are already sorted:

    join file1 file2
    

    if not:

    join <(sort file1) <(sort file2)
    
    0 讨论(0)
  • 2020-12-11 16:43

    This may be a little dated, but is tailor-made for simple UNIX utilities. Given:

    • keys are fixed-length (here 7 chars)
    • files are sorted (true in the example) allowing the use of fast merge sort

    Then:

    $ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7
    
    ENST002  80
    
    ENST001  90
    

    Variants:

    To strip the number printed after the key, remove tac command:

    $ sort -m file.contain.query.txt file.to.search.in.txt | uniq -d -w7
    

    To keep sorted order, add an extra tac command at the end:

    $ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7 | tac
    
    0 讨论(0)
  • 2020-12-11 16:55
    use strict;
    use warings;
    
    system("sort file.contain.query.txt > qsorted.txt");
    system("sort file.to.search.in.txt  > dsorted.txt");
    
    open (QFILE, "<qsorted.txt") or die();
    open (DFILE, "<dsorted.txt") or die();
    
    
    while (my $qline = <QFILE>) {
      my ($queryid) = ($qline =~ /ENST(\d+)/); 
      while (my $dline = <DFILE>) {
        my ($dataid) = ($dline =~ /ENST(\d+)/);
        if ($dataid == $queryid)   { print $qline; }
        elsif ($dataid > $queryid) { break; } 
      }
    }
    
    0 讨论(0)
提交回复
热议问题