Lookahead regex failing to find the same overlapping matches

佐手、 提交于 2021-01-28 08:20:52

问题


Is it possible to find overlapping matches using regular expressions when searching again for the same pattern? I want to be able to find matches that occurs three times. For example babab occurs three times in babababab:

babababab

babababab

babababab

This is my current Python implementation:

import re
matches = re.findall(r'(?=(\w+).*\1).*\1', "babababab")
print(matches)

My program find only baba instead of babab. Thanks!


回答1:


We can generalize the solution to any regex.

Let's say we have a valid regex pattern which you want to search for overlapping matches.

In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>), and we can nest a capturing group to capture the match (?=(<pattern>)).

This technique works for Python re engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.

Sample code:

import re

inp = '10.5.20.52.48.10'
matches = [m[0] if type(m) is tuple else m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]

Output:

['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']

If the original pattern doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.

However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.

Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+ in aaa will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.




回答2:


One trick you may use here is to actually just match on ba(?=bab), which would only consume ba, allowing the regex engine to shift forward logically by just one match:

matches = re.findall(r'ba(?=bab)', "babababab")
matches = [i + 'bab' for i in matches]
print(matches)

This prints:

['babab', 'babab', 'babab']

Note that I concatenate the tail bab to each match, which is fine, because we know the actual logic match was babab.



来源:https://stackoverflow.com/questions/60293891/lookahead-regex-failing-to-find-the-same-overlapping-matches

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!