Grep with multi-threading

问题

I have the following (big) file with 30233088 strings:

head mystringfile.txt:

GAATGAACACGAAGAA
GAATGAACACGAAGAC
GAATGAACACGAAGAG
GAATGAACACGAAGCA

cat sequence.txt

AAATAGAGGGCGGTCCAGGCGTGTCGAAACACTGGGTCCAGGGCAAGAGCGGTTCGGGTGTCAGGAAAGCCCCCAAGGGGGTTCGCGCGGTTTGCAGTGAGGTAGAGGCCGGTGTATGGGTAGACAATTGGGGTCCCAAAGAAAAAGGCTCGTCCAACATCATAATAAACCCAAGCACGATAAAAAGCAAACGCAGACTTCAATAGGGTACGAGCAATTGTGGCAGGGTGCTCGCTGTCAGGGTTAGATCTTCTTGGAGTCGCGTCGCTCGGGGGGGCAAGGCCAACGTAAGATCGTGGCTGATCGCTGGCAATGCGGTCGGTTGGGTGGTCGCTAGTAGGGGCACGGCGGTCTCTTATGGCGTCGTAAAATGCGTCTCCAAAGCGAAAAGGGGCGGCAGACAAGTCACCGGGCAAGCTTAGAGGTCTGGGGCCCGTGGCTTTAGGGGAATGAACACGAAGACGCGAAACGAAGTCGTGTTTCTTGTTGGCTGTAGAGGGGAAAACCGTCTGGGGCGATCTGGCGTAGTAGTGCGTGTCTTGCAGTGAGCTCCCCGTCCGTAAGGATTCGCAGGAATCCTGCGTGAAGCTCGGTCGTCTCGGCCGTGTCTCGGGGTTTGATTGCGGGTTCAGATTGGAAAGGTCTCCTCGGGTCGTTTGCTGCATTTGCTCGCAACCCTGACGTGAAAGGGGTGAGCTGTCTCCAATCTGCCACGCTGGGTGTTGCGTCGTCAGTAAAAGACTTGGTCAAGCTGGGACCTCGCAAGATCGCGAGAGGGTTAAGCACAAAAGGTATGGCGAAGCTCCCGGGTGCTCTTGTGGCCACCCAGAATCATGGTGACGTAGGTTTTGCGAAGCCATCAAAAATTCAGGCGGCAAAACGAGCCAGTAGGGTCCTGGGCAGCTGGGCTTGTAGTGGGTAGGCGGCAAAACGCAAAGAATGAACACGAAGCAACTCCGTAGTGTGACGGGGGTTCTGACAAACGTCCTGCAAGAAGTTCGTCTTGGG

which I need to grep in another sequence file to determine the position of the match, which I do as following:

while read line; do grep -b -o $line sequence.txt >>sequence.txt.count; done<mystringfile.txt

Running the code like this of course takes a very long time and just runs part of 1 thread, so how can I modify it (with parallel or xargs?) so that it is running on as many threads as I want to specify?

回答1:

Your idea is wrong with using shell loop to process text. You are opening one new file descriptor for re-directing to output file for each of the 30233088 iterations on your input file. It is prone to a huge performance impact or a case of running out of open file descriptors.

Use the right tool for the job. Awk is your friend here. If the sequence.txt is just a huge pattern as you say, you can just slurp it into a variable for regex match as below. The solutions involves no memory overhead of having to store entries in RAM

awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt

This should be relatively faster than the approach you have and to speed up things further, change your locale settings to match the C local,

LC_ALL=C awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}' mystringfile.txt

To match with the grep's option of -b to print the byte offset start, use n-1 in the answer above instead of just n.

If you still want to use GNU parallel, use --pipepart to split the file physically into parts and specify the --block size to how much MB of file content to read

parallel -a mystringfile.txt --pipepart --block=20M -q awk -v sequence="$(<sequence.txt)" 'n=index(sequence, $1){print n":"$1}'

回答2:

If all your search strings in mystringfile.txt are of the same length (like in your sample file, 16 bytes), you could store all 16 byte strings from sequence.txt to an associative array (if you have the memory) and speed up the searching. Let's try that. First we need some test material, let's create a 2400000 byte sequence.txt, takes about a second:

$ awk -v seed=$RANDOM 'BEGIN{a[0]="A";a[1]="C";a[2]="G";a[3]="T";srand(seed);for(i=1;i<=2400000;i++)printf "%s", a[int(4*rand())];print ""}' > mystringfile.txt

and mystringfile.txt with 30233088 16 byte search strings (4 mins 50 secs):

$ awk -v seed=$RANDOM 'BEGIN{a[0]="A";a[1]="C";a[2]="G";a[3]="T";srand(seed);for(i=1;i<=30233088;i++){for(j=1;j<=16;j++)printf "%s", a[int(4*rand())];print ""}}' > mystringfile.txt

Then the script:

$ awk '
NR==FNR {                              # process the sequence file
    l=length($1)-15                    # length-15 16 byte strings coming up
    for(i=1;i<=l;i++) {                # using l as variable name is stupid
        s=substr($0,i,16)              
        a[s]=a[s] (a[s]==""?"":",") i  # hash string start indexes to a, string as key
    }                                  # a["ACTGTGCACGTATAGC"]=3,141,592
    next
}                                      # the search part 
$1 in a {                              # if search string is found
    print a[$1], $1                    # output index and string
}' sequence.txt mystringfile.txt

Storing the 2400000 byte sequence.txt to the hash took 13 seconds and used 721 MB of memory on my mini laptop. The whole script ran 35 seconds and found around 17000 hits.

来源：https://stackoverflow.com/questions/53222347/grep-with-multi-threading

标签

awk

parallel-processing

grep

xargs