问题
Is it possible to find overlapping matches using regular expressions when searching again for the same pattern? I want to be able to find matches that occurs three times. For example babab occurs three times in babababab:
babababab
babababab
babababab
This is my current Python implementation:
import re
matches = re.findall(r'(?=(\w+).*\1).*\1', "babababab")
print(matches)
My program find only baba instead of babab. Thanks!
回答1:
We can generalize the solution to any regex.
Let's say we have a valid regex pattern which you want to search for overlapping matches.
In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>), and we can nest a capturing group to capture the match (?=(<pattern>)).
This technique works for Python re engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.
Sample code:
import re
inp = '10.5.20.52.48.10'
matches = [m[0] if type(m) is tuple else m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]
Output:
['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']
If the original pattern doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.
However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.
Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+ in aaa will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.
回答2:
One trick you may use here is to actually just match on ba(?=bab), which would only consume ba, allowing the regex engine to shift forward logically by just one match:
matches = re.findall(r'ba(?=bab)', "babababab")
matches = [i + 'bab' for i in matches]
print(matches)
This prints:
['babab', 'babab', 'babab']
Note that I concatenate the tail bab to each match, which is fine, because we know the actual logic match was babab.
来源:https://stackoverflow.com/questions/60293891/lookahead-regex-failing-to-find-the-same-overlapping-matches