Search for string allowing for one mismatch in any location of the string

后端 未结 13 957
闹比i
闹比i 2020-11-30 02:45

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi

13条回答
  •  刺人心
    刺人心 (楼主)
    2020-11-30 02:58

    >>> import re
    >>> seq="AGCCTCCCATGATTGAACAGATCAT"
    >>> genome = "CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT..."
    >>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
    
    >>> seq_re.findall(genome)  # list of matches
    []  
    
    >>> seq_re.search(genome) # None if not found, otherwise a match object
    

    This one stops a the first match, so may be a bit faster when there are multiple matches

    >>> print "found" if any(seq_re.finditer(genome)) else "not found"
    not found
    
    >>> print "found" if seq_re.search(genome) else "not found" 
    not found
    
    >>> seq="CAT"
    >>> seq_re=re.compile('|'.join(seq[:i]+'.'+seq[i+1:] for i in range(len(seq))))
    >>> print "found" if seq_re.search(genome) else "not found"
    found
    

    for a genome of length 10,000,000 you are looking at about 2.5 days for a single thread to scan 230,000 sequences, so you may want to split up the task onto a few cores/cpus.

    You can always start implementing a more efficient algorithm while this one is running :)

    If you should wish to search for single dropped or added elements change the regexp to this

    >>> seq_re=re.compile('|'.join(seq[:i]+'.{0,2}'+seq[i+1:] for i in range(len(seq))))
    

提交回复
热议问题