Search for string allowing for one mismatch in any location of the string

后端 未结 13 943
闹比i
闹比i 2020-11-30 02:45

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi

13条回答
  •  死守一世寂寞
    2020-11-30 03:20

    I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:

    fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
    fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!
    
    """
    Sample header:
    >gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
    """
    integer = Word(nums).setParseAction(lambda t:int(t[0]))
    genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") + 
                    "length=" + integer("genelen") + LineEnd() + 
                    Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))
    
    # read gene data from .fasta file - takes just under a couple of minutes
    genedata = OneOrMore(genebit).parseString(fastasrc)
    

    (Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)

    Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:

    class CloseMatch(Token):
        def __init__(self, seq, maxMismatches=1):
            super(CloseMatch,self).__init__()
            self.name = seq
            self.sequence = seq
            self.maxMismatches = maxMismatches
            self.errmsg = "Expected " + self.sequence
            self.mayIndexError = False
            self.mayReturnEmpty = False
    
        def parseImpl( self, instring, loc, doActions=True ):
            start = loc
            instrlen = len(instring)
            maxloc = start + len(self.sequence)
    
            if maxloc <= instrlen:
                seq = self.sequence
                seqloc = 0
                mismatches = []
                throwException = False
                done = False
                while loc < maxloc and not done:
                    if instring[loc] != seq[seqloc]:
                        mismatches.append(seqloc)
                        if len(mismatches) > self.maxMismatches:
                            throwException = True
                            done = True
                    loc += 1
                    seqloc += 1
            else:
                throwException = True
    
            if throwException:
                exc = self.myException
                exc.loc = loc
                exc.pstr = instring
                raise exc
    
            return loc, (instring[start:loc],mismatches)
    

    For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)

    I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):

    searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
    for g in genedata:
        print "%s (%d)" % (g.id, g.genelen)
        print "-"*24
        for t,startLoc,endLoc in searchseq.scanString(g.gene):
            matched, mismatches = t[0]
            print "MATCH:", searchseq.sequence
            print "FOUND:", matched
            if mismatches:
                print "      ", ''.join(' ' if i not in mismatches else '*' 
                                for i,c in enumerate(searchseq.sequence))
            else:
                print ""
            print "at location", startLoc
            print
        print
    

    I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.

    This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:

    scf_1104442825154 (964)
    ------------------------
    
    scf_1104442822828 (942)
    ------------------------
    
    scf_1104442824510 (987)
    ------------------------
    
    scf_1104442823180 (1065)
    ------------------------
    ...
    

    I was getting discouraged, not to see any matches until:

    scf_1104442823952 (1188)
    ------------------------
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAACGGAATCGAATGGAAT
                    *      *        
    at location 33
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAT
                           *        
    at location 175
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAT
                           *        
    at location 474
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAT
                           *        
    at location 617
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATAGAAT
                           *   *    
    at location 718
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGATTCGAATGGAAT
                        *  *        
    at location 896
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGTAT
                           *     *  
    at location 945
    

    And finally my exact match at:

    scf_1104442823584 (1448)
    ------------------------
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGACTCGAATGGAAT
                        *  *        
    at location 177
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCAAATGGAAT
                           *        
    at location 203
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCAAATGGAATCGAATGGAAT
                 *         *        
    at location 350
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCGAATGGAAA
                           *       *
    at location 523
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCAAATGGAATCGAATGGAAT
                 *         *        
    at location 822
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCATCGAATGGAATCTAATGGAAT
    
    at location 848
    
    MATCH: ATCATCGAATGGAATCTAATGGAAT
    FOUND: ATCGTCGAATGGAGTCTAATGGAAT
              *         *           
    at location 969
    

    So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.

    For comparison, here is an RE-based version, that finds 1-mismatch matches only:

    import re
    seqStr = "ATCATCGAATGGAATCTAATGGAAT"
    searchSeqREStr = seqStr + '|' + \
        '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:] 
                 for i,c in enumerate(seqStr))
    
    searchSeqRE = re.compile(searchSeqREStr)
    
    for g in genedata:
        print "%s (%d)" % (g.id, g.genelen)
        print "-"*24
        for match in searchSeqRE.finditer(g.gene):
            print "MATCH:", seqStr
            print "FOUND:", match.group(0)
            print "at location", match.start()
            print
        print
    

    (At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)

    So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.

提交回复
热议问题