How to catch the longest sequence of a group

情到浓时终转凉″ 提交于 2020-07-05 12:34:20

问题


The task is to find the longest sequence of a group

for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" and it has 7 occurrences of AGATC. (AGATC) matches all occurrences. Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC in the given text? If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC, 2nd - AGATCAGATCAGATCAGATCAGATC et cetera) in python?


回答1:


Use:

import re

sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(?:AGATC)+', sequence)

# To find the longest subsequence
longest = max(matches, key=len)

Explanation:

Non-capturing group (?:AGATC)+

  • + Quantifier — Matches between one and unlimited times, as many times as possible.
  • AGATC matches the characters AGATC literally (case sensitive)

Result:

# print(matches)
['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']

# print(longest)
'AGATCAGATCAGATCAGATCAGATC'

You can test the regex here.




回答2:


The central question is, "Is it possible to write a regular expression that catches only the longest sequence?" The answer is "yes":

import re

s = 'AGATC_AGATCAGATC_AGATCAGATCAGATC_AGATC_AGATCAGATC'

m = re.search(r'((?:AGATC)+)(?!.*\1)', s)
print m.group() if m else ''
  #=> "AGATCAGATCAGATC"

Regex demo<¯\(ツ)>Python demo

Python's regex engine performs the following operations.

(            begin capture group 1
  (?:AGATC)  match 'AGATC' in a non-capture group
  +          execute the non-capture group 1+ times
)            end capture group 1
(?!          begin a negative lookahead
  .*         match 0+ characters
  \1         match the content of capture group 1
)            end the negative lookahead

For the string s above, AGATC would first be matched but the negative lookahead would find AGATC as the first part of AGATCAGATC, so the tentative match would be rejected. Then AGATCAGATC would be matched, but the negative lookahead would find AGATCAGATC as the first part of AGATCAGATCAGATC so that tentative match would also be rejected. Next, AGATCAGATCAGATC would be matched and accepted, as the negative lookahead would not find that match later in the string. (re.findall, unlike re.search, would also match AGATCAGATC at the end of the string.)

If re.findall were used there may be multiple matches after the longest one (see the last test string at the link to the regex demo), but the lengths of the matches are non-decreasing from the first to the last. Therefore, the first match, obtained using re.search is a longest match.




回答3:


Use re.finditer() to iterate over all matches. Then use max() with a key function to find the longest. Make it a function so you can use different groups.

import re

def find_longest(sequence, group):
    # build pattern
    pattern = fr"(?:{group})+"

    # iterate over all matches
    matches = (match[0] for match in re.finditer(pattern, sequence))

    # find the longest
    return max(matches, key=len)

seq = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"

find_longest(seq, "AGATC")


来源:https://stackoverflow.com/questions/62078827/how-to-catch-the-longest-sequence-of-a-group

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!