How do I find the shortest overlapping match using regular expressions?

前端 未结 9 1108
無奈伤痛
無奈伤痛 2020-12-15 08:03

I\'m still relatively new to regex. I\'m trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a sub

相关标签:
9条回答
  • 2020-12-15 08:22

    A Python loop to look for the shortest match, by brute force testing each substring from left to right, picking the shortest:

    shortest = None
    for i in range(len(string)):
        m = my_regex.match(string[i:])
        if m: 
            mstr = m.group()
            if shortest is None or len(mstr) < len(shortest):
                shortest = mstr
    
    print shortest
    

    Another loop, this time letting re.findall do the hard work of searching for all possible matches, then brute force testing each match right-to-left looking for a shorter substring:

    # find all matches using findall
    matches = my_regex.findall(string)
    
    # for each match, try to match right-hand substrings
    shortest = None
    for m in matches:
        for i in range(-1,-len(m),-1):
            mstr = m[i:]        
            if my_regex.match(mstr):
                break
        else:
            mstr = m
    
        if shortest is None or len(mstr) < len(shortest):
            shortest = mstr
    
    print shortest
    
    0 讨论(0)
  • 2020-12-15 08:28

    Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:

    >>> my_pattern = '(?=(a.*?b.*?c))'
    >>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
    >>> matches = my_regex.findall(string)
    >>> print min(matches, key=len)
    A|B|C
    

    findall() will return all possible matches, so you need min() to get the shortest one.

    How this works:

    • We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
    • At each position, the regex engine looks ahead to see whether your regex would match at this position.
    • If so, it will be captured by the capturing group.
    • If not, it won't.
    • In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
    • Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.
    0 讨论(0)
  • 2020-12-15 08:29

    No. Perl returns the longest, leftmost match, while obeying your non-greedy quantifiers. You'll have to loop, I'm afraid.

    Edit: Yes, I realize I said Perl above, but I believe it is true for Python.

    0 讨论(0)
提交回复
热议问题