Is there a way to remove duplicate and continuous words/phrases in a string? E.g.
[in]: foo foo bar bar foo bar
With a pattern similar to sharcashmo's pattern, you can use subn that returns the number of replacements, inside a while loop :
import re
txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'
pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
repl = r'\1'
res = txt
while True:
[res, nbr] = pattern.subn(repl, res)
if (nbr == 0):
break
print res
When there is no more replacements the while loop stops.
With this method you can get all overlapped matches (that is impossible with a single pass in a replacement context), without testing two times the same pattern.