Is there a way to remove duplicate and continuous words/phrases in a string?

前端 未结 6 452
广开言路
广开言路 2021-01-13 11:27

Is there a way to remove duplicate and continuous words/phrases in a string? E.g.

[in]: foo foo bar bar foo bar

6条回答
  •  梦谈多话
    2021-01-13 12:01

    I love itertools. It seems like every time I want to write something, itertools already has it. In this case, groupby takes a list and groups repeated, sequential items from that list into a tuple of (item_value, iterator_of_those_values). Use it here like:

    >>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
    >>> ' '.join(item[0] for item in groupby(s.split()))
    'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'
    

    So let's extend that a little with a function that returns a list with its duplicated repeated values removed:

    from itertools import chain, groupby
    
    def dedupe(lst):
        return list(chain(*[item[0] for item in groupby(lst)]))
    

    That's great for one-word phrases, but not helpful for longer phrases. What to do? Well, first, we'll want to check for longer phrases by striding over our original phrase:

    def stride(lst, offset, length):
        if offset:
            yield lst[:offset]
    
        while True:
            yield lst[offset:offset + length]
            offset += length
            if offset >= len(lst):
                return
    

    Now we're cooking! OK. So our strategy here is to first remove all the single-word duplicates. Next, we'll remove the two-word duplicates, starting from offset 0 then 1. After that, three-word duplicates starting at offsets 0, 1, and 2, and so on until we've hit five-word duplicates:

    def cleanse(list_of_words, max_phrase_length):
        for length in range(1, max_phrase_length + 1):
            for offset in range(length):
                list_of_words = dedupe(stride(list_of_words, offset, length))
    
        return list_of_words
    

    Putting it all together:

    from itertools import chain, groupby
    
    def stride(lst, offset, length):
        if offset:
            yield lst[:offset]
    
        while True:
            yield lst[offset:offset + length]
            offset += length
            if offset >= len(lst):
                return
    
    def dedupe(lst):
        return list(chain(*[item[0] for item in groupby(lst)]))
    
    def cleanse(list_of_words, max_phrase_length):
        for length in range(1, max_phrase_length + 1):
            for offset in range(length):
                list_of_words = dedupe(stride(list_of_words, offset, length))
    
        return list_of_words
    
    a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'
    
    b = 'this is a sentence where phrases duplicate . sentence are not prhases .'
    
    print ' '.join(cleanse(a.split(), 5)) == b
    

提交回复
热议问题