Is there a way to remove duplicate and continuous words/phrases in a string?

前端 未结 6 450
广开言路
广开言路 2021-01-13 11:27

Is there a way to remove duplicate and continuous words/phrases in a string? E.g.

[in]: foo foo bar bar foo bar

6条回答
  •  臣服心动
    2021-01-13 12:04

    With a pattern similar to sharcashmo's pattern, you can use subn that returns the number of replacements, inside a while loop :

    import re
    
    txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'
    
    pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
    repl = r'\1'
    
    res = txt
    
    while True:
        [res, nbr] = pattern.subn(repl, res)
        if (nbr == 0):
            break
    
    print res
    

    When there is no more replacements the while loop stops.

    With this method you can get all overlapped matches (that is impossible with a single pass in a replacement context), without testing two times the same pattern.

提交回复
热议问题