Using regex to search until desired pattern

问题

I am using the following regex:

orfre = '^(?:...)*?((ATG)(...){%d,}?(?=(TAG|TAA|TGA)))' % (aa)

I basically want to find all sequences that start with ATG followed by triplets (e.g. TTA, TTC, GTC, etc.) until it finds a stop codon in frame. However, as my regex is written, it won't actually stop at a stop codon if aa is large. Instead, it will keep searching until it finds one such that the condition of aa is met. I would rather have it search the entire string until a stop codon is found. If a match isn't long enough (for a given aa argument) then it should return None.

String data: AAAATGATGCATTAACCCTAATAA

Desired output from regex: ATGATGCATTAA

Unless aa > 5, in which case nothing should be returned.

Actual output I'm getting: ATGATGCATTAACCCTAA

回答1:

This should do the trick. You can see it on codepad.

import re

num = 4
blue = 'XXXAAAATGATGCATTAACCCTAATAAXXX'
pattern = "^(?:...)*(ATG(...){%d}(?:TAG|TAA|TGA))" % num

m = re.match(pattern, blue)
print m.group(1)

Which outputs: ATGCATTAACCCTAATAA

Breaking it down:

^
(?:...)*           - Find, but don't capture any number of triplets.
(                  - Begin our capture block
  ATG              - A literal string of 'ATG', no need to wrap.
  (...)*           - Any number of triplets
  (?:TAG|TAA|TGA)  - A non capturing block of either 'TAG', 'TAA' or 'TGA'
)                  - End the capture block.

Unless I'm missing some other requirements, it shouldn't need to be much more complex than this.

回答2:

Supplementary note: if you want to check the six frames available in one sequence, don't forget to check also the complementary chain:

comp_chain = chain[::-1]

(--> extended slices)

Transliterating latter A's for T's and G's for C's.

来源：https://stackoverflow.com/questions/18731894/using-regex-to-search-until-desired-pattern

标签

python

regex

bioinformatics