I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
This is quite old but perhaps this simple solution could work. loop through the sequence taking 25character slices. convert the slice to an numpy array. Compare to the 25char string (also as a numpy array). Sum the answer and if the answer is 24 print out the position in the loop and the mismatch.
te next few lines show it working
import numpy as np
a = ['A','B','C']
b = np.array(a)
b
array(['A', 'B', 'C'], dtype='
c = ['A','D','C']
d = np.array(c)
b==d
array([ True, False, True])
sum(b==d)
2