Search for string allowing for one mismatch in any location of the string

后端未结

关注

 13  957

闹比i 2020-11-30 02:45

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi

13条回答

刺人心 (楼主)

2020-11-30 02:58

>>> import re
>>> seq="AGCCTCCCATGATTGAACAGATCAT"
>>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))

>>> seq_re.findall(genome)  # list of matches
[]  

>>> seq_re.search(genome) # None if not found, otherwise a match object

This one stops a the first match, so may be a bit faster when there are multiple matches

>>> print "found" if any(seq_re.finditer(genome)) else "not found"
not found

>>> print "found" if seq_re.search(genome) else "not found" 
not found

>>> seq="CAT"
>>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
>>> print "found" if seq_re.search(genome) else "not found"
found

for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.

You can always start implementing a more efficient algorithm while this one is running :)

If you should wish to search for single dropped or added elements change the regexp to this

>>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))

0 讨论(0)

查看其它13个回答