I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.
import tre
pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
data = """
In addition to fundamental contributions in several branches of
theoretical computer science, Donnald Erwin Kuth is the creator of
the TeX computer typesetting system, the related METAFONT font
definition language and rendering system, and the Computer Modern
family of typefaces.
"""
fz = tre.Fuzzyness(maxerr = 3)
print fz
m = pt.search(data, fz)
if m:
print m.groups()
print m[0]
which will output
tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
((95, 113), (99, 108), (102, 108))
Donnald Erwin Kuth
http://en.wikipedia.org/wiki/TRE_%28computing%29
http://laurikari.net/tre/