I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
>>> import re
>>> seq="AGCCTCCCATGATTGAACAGATCAT"
>>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> seq_re.findall(genome) # list of matches
[]
>>> seq_re.search(genome) # None if not found, otherwise a match object
This one stops a the first match, so may be a bit faster when there are multiple matches
>>> print "found" if any(seq_re.finditer(genome)) else "not found"
not found
>>> print "found" if seq_re.search(genome) else "not found"
not found
>>> seq="CAT"
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> print "found" if seq_re.search(genome) else "not found"
found
for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.
You can always start implementing a more efficient algorithm while this one is running :)
If you should wish to search for single dropped or added elements change the regexp to this
>>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))