问题
I am using the following regex:
orfre = '^(?:...)*?((ATG)(...){%d,}?(?=(TAG|TAA|TGA)))' % (aa)
I basically want to find all sequences that start with ATG followed by triplets (e.g. TTA, TTC, GTC, etc.) until it finds a stop codon in frame. However, as my regex is written, it won't actually stop at a stop codon if aa is large. Instead, it will keep searching until it finds one such that the condition of aa is met. I would rather have it search the entire string until a stop codon is found. If a match isn't long enough (for a given aa argument) then it should return None.
String data: AAAATGATGCATTAACCCTAATAA
Desired output from regex: ATGATGCATTAA
Unless aa > 5, in which case nothing should be returned.
Actual output I'm getting: ATGATGCATTAACCCTAA
回答1:
This should do the trick. You can see it on codepad.
import re
num = 4
blue = 'XXXAAAATGATGCATTAACCCTAATAAXXX'
pattern = "^(?:...)*(ATG(...){%d}(?:TAG|TAA|TGA))" % num
m = re.match(pattern, blue)
print m.group(1)
Which outputs: ATGCATTAACCCTAATAA
Breaking it down:
^
(?:...)* - Find, but don't capture any number of triplets.
( - Begin our capture block
ATG - A literal string of 'ATG', no need to wrap.
(...)* - Any number of triplets
(?:TAG|TAA|TGA) - A non capturing block of either 'TAG', 'TAA' or 'TGA'
) - End the capture block.
Unless I'm missing some other requirements, it shouldn't need to be much more complex than this.
回答2:
Supplementary note: if you want to check the six frames available in one sequence, don't forget to check also the complementary chain:
comp_chain = chain[::-1]
(--> extended slices)
Transliterating latter A's for T's and G's for C's.
来源:https://stackoverflow.com/questions/18731894/using-regex-to-search-until-desired-pattern