How do I find the shortest overlapping match using regular expressions?

前端 未结 9 1107
無奈伤痛
無奈伤痛 2020-12-15 08:03

I\'m still relatively new to regex. I\'m trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a sub

相关标签:
9条回答
  • 2020-12-15 08:07

    You might be able to write the regex in such a way that it can't contain smaller matches.

    For your regex:

    a.*?b.*?c
    

    I think you can write this:

    a[^ab]*b[^c]*c
    

    It's tricky to get that correct, and I don't see any more general or more obviously correct way to do it. (Edit—earlier I suggested a negative lookahead assertion, but I don't see a way to make that work.)

    0 讨论(0)
  • 2020-12-15 08:08

    No, there isn't in the Python regular expression engine.

    My take for a custom function, though:

    import re, itertools
    
    # directly from itertools recipes
    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = itertools.tee(iterable)
        for elem in b:
            break
        return itertools.izip(a, b)
    
    def find_matches(rex, text):
        "Find all matches, even overlapping ones"
        matches= list(rex.finditer(text))
    
        # first produce typical matches
        for match in matches:
            yield match.group(0)
    
        # next, run it for any patterns included in matches
        for match1, match2 in pairwise(matches):
            subtext= text[match1.start()+1:match2.end()+1]
            for result in find_matches(rex, subtext):
                yield result
    
        # also test the last match, if there was at least one
        if matches:
            subtext= text[matches[-1].start()+1:matches[-1].end()+1]
            # perhaps the previous "matches[-1].end()+1" can be omitted
            for result in find_matches(rex, subtext):
                yield result
    
    def shortest_match(rex, text):
        "Find the shortest match"
        return min(find_matches(rex, text), key=len)
    
    if __name__ == "__main__":
        pattern= re.compile('a.*?b.*?c', re.I)
        searched_text= "A|B|A|B|C|D|E|F|G"
        print (shortest_match(pattern, searched_text))
    
    0 讨论(0)
  • 2020-12-15 08:12

    This might be a useful application of sexegers. Regular-expression matching is biased toward the longest, leftmost choice. Using non-greedy quantifiers such as in .*? skirts the longest part, and reversing both the input and pattern can get around leftmost-matching semantics.

    Consider the following program that outputs A|B|C as desired:

    #! /usr/bin/env python
    
    import re
    
    string = "A|B|A|B|C|D|E|F|G"
    my_pattern = 'c.*?b.*?a'
    
    my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
    matches = my_regex.findall(string[::-1])
    
    for match in matches:
        print match[::-1]
    

    Another way is to make a stricter pattern. Say you don't want to allow repetitions of characters already seen:

    my_pattern = 'a[^a]*?b[^ab]*?c'
    

    Your example is generic and contrived, but if we had a better idea of the inputs you're working with, we could offer better, more helpful suggestions.

    0 讨论(0)
  • 2020-12-15 08:19

    Another regex solution; it finds only the last occurence of .*a.*b.*c:

    my_pattern = 'a(?!.*a.*b.*c).*b[^c]*c'
    

    a(?!.*a.*?b.*?c) ensures that there is no 'a.*?b.*?c' after first 'A' strings like A|A|B|C or A|B|A|B|C or A|B|C|A|B|C in results are eliminated

    b[^c]*c ensures that after 'B' there is only one 'C' strings like A|B|C|B|C or A|B|C|C in results are eliminated

    So you have the smallest matching 'a.*?b.*?c'

    0 讨论(0)
  • 2020-12-15 08:19

    I do not think that this task can be accomplished by a single regular expression. I have no proof that this is the case, but there are quite a lot of things that can't be done with regexes and I expected this problem to be one of them. Some good examples of the limitations of regexes are given in this blog post.

    0 讨论(0)
  • 2020-12-15 08:20

    The regex engine starts searching from the beginning of the string till it finds a match and then exits. Thus if it finds a match before it even considers the smaller one, there is no way for you to force it to consider later matches in the same run - you will have to rerun the regex on substrings.

    Setting the global flag and choosing the shortest matched string won't help as it is evident from your example - the shorter match might be a substring of another match (or partly included in it). I believe you will have to start subsequent searches from (1 + index of previous match) and go on like that.

    0 讨论(0)
提交回复
热议问题