Ambiguous substring with mismatches

南楼画角 提交于 2019-12-11 07:01:27

问题


I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR, where R could be A or G. Also, the script must allow x number of mismatches. So this is my code

import regex

s = 'ACTGCTGAGTCGT'    
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)

So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT and ACTGC**TGA**GTCGT and ACTGCTGAGT**CGT**. The expected result should be like this:

['TGC', 'TGA', 'AGT', 'CGT']

But the output is

['TGC', 'TGA']

Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is

['TGC', 'TGA']

Is there another way to get all the substrings?


回答1:


If I understand well, you are looking for all three letters substrings that match the pattern T[GA]T and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.

To obtain the expected result, you have to change {e<=1} to {s<=1} (or {s<2}) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate {s<=1} is only linked to the last letter:

regex.findall(r'(T[AG]T){s<=1}', s, overlapped=True)


来源:https://stackoverflow.com/questions/46355841/ambiguous-substring-with-mismatches

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!