Search for string allowing for one mismatch in any location of the string

后端 未结 13 932
闹比i
闹比i 2020-11-30 02:45

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi

13条回答
  •  萌比男神i
    2020-11-30 03:08

    You can use regex matching library TRE, for "approximate matching". It also has bindings for Python, Perl and Haskell.

    import tre
    
    pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
    data = """
    In addition to fundamental contributions in several branches of
    theoretical computer science, Donnald Erwin Kuth is the creator of
    the TeX computer typesetting system, the related METAFONT font
    definition language and rendering system, and the Computer Modern
    family of typefaces.
    """
    
    fz = tre.Fuzzyness(maxerr = 3)
    print fz
    m = pt.search(data, fz)
    
    if m:
        print m.groups()
        print m[0]
    

    which will output

    tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
    ((95, 113), (99, 108), (102, 108))
    Donnald Erwin Kuth
    

    http://en.wikipedia.org/wiki/TRE_%28computing%29

    http://laurikari.net/tre/

提交回复
热议问题