I\'m still relatively new to regex. I\'m trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a sub
You might be able to write the regex in such a way that it can't contain smaller matches.
For your regex:
a.*?b.*?c
I think you can write this:
a[^ab]*b[^c]*c
It's tricky to get that correct, and I don't see any more general or more obviously correct way to do it. (Edit—earlier I suggested a negative lookahead assertion, but I don't see a way to make that work.)
No, there isn't in the Python regular expression engine.
My take for a custom function, though:
import re, itertools
# directly from itertools recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
for elem in b:
break
return itertools.izip(a, b)
def find_matches(rex, text):
"Find all matches, even overlapping ones"
matches= list(rex.finditer(text))
# first produce typical matches
for match in matches:
yield match.group(0)
# next, run it for any patterns included in matches
for match1, match2 in pairwise(matches):
subtext= text[match1.start()+1:match2.end()+1]
for result in find_matches(rex, subtext):
yield result
# also test the last match, if there was at least one
if matches:
subtext= text[matches[-1].start()+1:matches[-1].end()+1]
# perhaps the previous "matches[-1].end()+1" can be omitted
for result in find_matches(rex, subtext):
yield result
def shortest_match(rex, text):
"Find the shortest match"
return min(find_matches(rex, text), key=len)
if __name__ == "__main__":
pattern= re.compile('a.*?b.*?c', re.I)
searched_text= "A|B|A|B|C|D|E|F|G"
print (shortest_match(pattern, searched_text))
This might be a useful application of sexegers. Regular-expression matching is biased toward the longest, leftmost choice. Using non-greedy quantifiers such as in .*?
skirts the longest part, and reversing both the input and pattern can get around leftmost-matching semantics.
Consider the following program that outputs A|B|C
as desired:
#! /usr/bin/env python
import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'c.*?b.*?a'
my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string[::-1])
for match in matches:
print match[::-1]
Another way is to make a stricter pattern. Say you don't want to allow repetitions of characters already seen:
my_pattern = 'a[^a]*?b[^ab]*?c'
Your example is generic and contrived, but if we had a better idea of the inputs you're working with, we could offer better, more helpful suggestions.
Another regex solution; it finds only the last occurence of .*a.*b.*c:
my_pattern = 'a(?!.*a.*b.*c).*b[^c]*c'
a(?!.*a.*?b.*?c)
ensures that there is no 'a.*?b.*?c'
after first 'A'
strings like A|A|B|C or A|B|A|B|C or A|B|C|A|B|C in results are eliminated
b[^c]*c
ensures that after 'B' there is only one 'C'
strings like A|B|C|B|C or A|B|C|C in results are eliminated
So you have the smallest matching 'a.*?b.*?c'
I do not think that this task can be accomplished by a single regular expression. I have no proof that this is the case, but there are quite a lot of things that can't be done with regexes and I expected this problem to be one of them. Some good examples of the limitations of regexes are given in this blog post.
The regex engine starts searching from the beginning of the string till it finds a match and then exits. Thus if it finds a match before it even considers the smaller one, there is no way for you to force it to consider later matches in the same run - you will have to rerun the regex on substrings.
Setting the global flag and choosing the shortest matched string won't help as it is evident from your example - the shorter match might be a substring of another match (or partly included in it). I believe you will have to start subsequent searches from (1 + index of previous match) and go on like that.