Using regex to search until desired pattern

不羁的心 提交于 2019-12-13 04:22:30

问题


I am using the following regex:

orfre = '^(?:...)*?((ATG)(...){%d,}?(?=(TAG|TAA|TGA)))' % (aa)

I basically want to find all sequences that start with ATG followed by triplets (e.g. TTA, TTC, GTC, etc.) until it finds a stop codon in frame. However, as my regex is written, it won't actually stop at a stop codon if aa is large. Instead, it will keep searching until it finds one such that the condition of aa is met. I would rather have it search the entire string until a stop codon is found. If a match isn't long enough (for a given aa argument) then it should return None.

String data: AAAATGATGCATTAACCCTAATAA

Desired output from regex: ATGATGCATTAA

Unless aa > 5, in which case nothing should be returned.

Actual output I'm getting: ATGATGCATTAACCCTAA


回答1:


This should do the trick. You can see it on codepad.

import re

num = 4
blue = 'XXXAAAATGATGCATTAACCCTAATAAXXX'
pattern = "^(?:...)*(ATG(...){%d}(?:TAG|TAA|TGA))" % num

m = re.match(pattern, blue)
print m.group(1)

Which outputs: ATGCATTAACCCTAATAA

Breaking it down:

^
(?:...)*           - Find, but don't capture any number of triplets.
(                  - Begin our capture block
  ATG              - A literal string of 'ATG', no need to wrap.
  (...)*           - Any number of triplets
  (?:TAG|TAA|TGA)  - A non capturing block of either 'TAG', 'TAA' or 'TGA'
)                  - End the capture block.

Unless I'm missing some other requirements, it shouldn't need to be much more complex than this.




回答2:


Supplementary note: if you want to check the six frames available in one sequence, don't forget to check also the complementary chain:

comp_chain = chain[::-1]    

(--> extended slices)

Transliterating latter A's for T's and G's for C's.



来源:https://stackoverflow.com/questions/18731894/using-regex-to-search-until-desired-pattern

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!