Pythonic way of removing reversed duplicates in list

后端未结

关注

 9  2404

I have a list of pairs:

[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]

and I want to remove any duplicates where

[a,b] == [


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  小蘑菇        
                
              
                            
                2020-12-06 10:29
              
            
            
                                                                       
If you only want to remove reversed pairs and don't want external libraries you could use a simple generator function (loosly based on the itertools  "unique_everseen" recipe):

def remove_reversed_duplicates(iterable):
    # Create a set for already seen elements
    seen = set()
    for item in iterable:
        # Lists are mutable so we need tuples for the set-operations.
        tup = tuple(item)
        if tup not in seen:
            # If the tuple is not in the set append it in REVERSED order.
            seen.add(tup[::-1])
            # If you also want to remove normal duplicates uncomment the next line
            # seen.add(tup)
            yield item

>>> list(remove_reversed_duplicates(a))
[[0, 1], [0, 4], [1, 4]]


The generator function might be a pretty fast way to solve this problem because set-lookups are really cheap. This approach also keeps the order of your initial list and only removes reverse duplicates while being faster than most of the alternatives!



If you don't mind using an external library and you want to remove all duplicates (reversed and identical) an alternative is: iteration_utilities.unique_everseen

>>> a = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(a, key=set))
[[0, 1], [0, 4], [1, 4]]


This checks if any item has the same contents in arbitary order (thus the key=set) as another. In this case this works as expected but it also removes duplicate [a, b] instead of only [b, a] occurences. You could also use key=sorted (like the other answers suggest). The unique_everseen like this has a bad algorithmic complexity because the result of the key function is not hashable and thus the fast lookup is replaced by a slow lookup. To speed this up you need to make the keys hashable, for example by converting them to sorted tuples (like some other answers suggest):

>>> from iteration_utilities import chained
>>> list(unique_everseen(a, key=chained(sorted, tuple)))
[[0, 1], [0, 4], [1, 4]]


The chained is nothing else than a faster alternative to lambda x: tuple(sorted(x)).

EDIT: As mentioned by @jpmc26 one could use frozenset instead of normal sets:

>>> list(unique_everseen(a, key=frozenset))
[[0, 1], [0, 4], [1, 4]]




To get an idea about the performance I did some timeit comparisons for the different suggestions:

>>> a = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]

>>> %timeit list(remove_reversed_duplicates(a))
100000 loops, best of 3: 16.1 µs per loop
>>> %timeit list(unique_everseen(a, key=frozenset))
100000 loops, best of 3: 13.6 µs per loop
>>> %timeit list(set(map(frozenset, a)))
100000 loops, best of 3: 7.23 µs per loop

>>> %timeit list(unique_everseen(a, key=set))
10000 loops, best of 3: 26.4 µs per loop
>>> %timeit list(unique_everseen(a, key=chained(sorted, tuple)))
10000 loops, best of 3: 25.8 µs per loop
>>> %timeit [list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in a]))]
10000 loops, best of 3: 29.8 µs per loop
>>> %timeit set(tuple(item) for item in map(sorted, a))
10000 loops, best of 3: 28.5 µs per loop


Long list with many duplicates:

>>> import random
>>> a = [[random.randint(0, 10), random.randint(0,10)] for _ in range(10000)]

>>> %timeit list(remove_reversed_duplicates(a))
100 loops, best of 3: 12.5 ms per loop
>>> %timeit list(unique_everseen(a, key=frozenset))
100 loops, best of 3: 10 ms per loop
>>> %timeit set(map(frozenset, a))
100 loops, best of 3: 10.4 ms per loop

>>> %timeit list(unique_everseen(a, key=set))
10 loops, best of 3: 47.7 ms per loop
>>> %timeit list(unique_everseen(a, key=chained(sorted, tuple)))
10 loops, best of 3: 22.4 ms per loop
>>> %timeit [list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in a]))]
10 loops, best of 3: 24 ms per loop
>>> %timeit set(tuple(item) for item in map(sorted, a))
10 loops, best of 3: 35 ms per loop


And with fewer duplicates:

>>> a = [[random.randint(0, 100), random.randint(0,100)] for _ in range(10000)]

>>> %timeit list(remove_reversed_duplicates(a))
100 loops, best of 3: 15.4 ms per loop
>>> %timeit list(unique_everseen(a, key=frozenset))
100 loops, best of 3: 13.1 ms per loop
>>> %timeit set(map(frozenset, a))
100 loops, best of 3: 11.8 ms per loop


>>> %timeit list(unique_everseen(a, key=set))
1 loop, best of 3: 1.96 s per loop
>>> %timeit list(unique_everseen(a, key=chained(sorted, tuple)))
10 loops, best of 3: 24.2 ms per loop
>>> %timeit [list(tpl) for tpl in list(set([tuple(sorted(pair)) for pair in a]))]
10 loops, best of 3: 31.1 ms per loop
>>> %timeit set(tuple(item) for item in map(sorted, a))
10 loops, best of 3: 36.7 ms per loop


So the variants with remove_reversed_duplicates, unique_everseen(key=frozenset) and set(map(frozenset, a)) seem to be by far the fastest solutions. Which one depends on the length of the input and the number of duplicates.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2020-12-06 10:30
              
            
            
                                                                       
If you need to preserve the order of the elements in the list then, you can use a the sorted function and set comprehension with map like this:

lst = [0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]
data = {tuple(item) for item in map(sorted, lst)}
# {(0, 1), (0, 4), (1, 4)}


or simply without map like this:

data = {tuple(sorted(item)) for item in lst}


Another way is to use a frozenset as shown here however note that this only work if you have distinct elements in your list. Because like set,  frozenset always contains unique values. So you will end up with unique value in your sublist(lose data) which may not be what you want.

To output a list, you can always use list(map(list, result)) where result is a set of tuple only in Python-3.0 or newer.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-12-06 10:30
              
            
            
                                                                       
You could use the builtin filter function.
from __future__ import print_function

def my_filter(l):
    seen = set()

    def not_seen(it):
        s = min(*it), max(*it)
        if s in seen:
            return False
        else:
            seen.add(s)
            return True
        
    out = filter(not_seen, l)

    return out

myList = [[0, 1], [0, 4], [1, 0], [1, 4], [4, 0], [4, 1]]
print(my_filter(myList)) # [[0, 1], [0, 4], [1, 4]]

As a complement I would orient you to the Python itertools module which describes a unique_everseen function which does basically the same thing as above but in a lazy, generator-based, memory-efficient version. Might be better than any of our solutions if you are working on large arrays. Here is how to use it:
from itertools import ifilterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

gen = unique_everseen(myList, lambda x: (min(x), max(x))) # gen is an iterator
print(gen) # <generator object unique_everseen at 0x7f82af492fa0>
result = list(gen) # consume generator into a list.
print(result) # [[0, 1], [0, 4], [1, 4]]

I haven't done any metrics to see who's fastest. However memory-efficiency and O complexity seem better in this version.
Timing min/max vs sorted
The builtin sorted function could be passed to unique_everseen to order items in the inner vectors. Instead, I pass lambda x: (min(x), max(x)). Since I know the vector size which is exactly 2, I can proceed like this.
To use sorted I would need to pass lambda x: tuple(sorted(x)) which adds overhead. Not dramatically, but still.
myList = [[random.randint(0, 10), random.randint(0,10)] for _ in range(10000)]
timeit.timeit("list(unique_everseen(myList, lambda x: (min(x), max(x))))", globals=globals(), number=20000)
>>> 156.81979029000013
timeit.timeit("list(unique_everseen(myList, lambda x: tuple(sorted(x))))", globals=globals(), number=20000)
>>> 168.8286430349999

Timings done in Python 3, which adds the globals kwarg to timeit.timeit.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复