Is it faster to union sets or check the whole list for a duplicate?

久未见 提交于 2019-12-20 15:14:16

问题


Sorry for the poorly worded title but I asked a question earlier about getting a unique list of items from two lists. People told me to make the list -> sets and then union.

So now I'm wondering if it's faster to:

  1. While adding one item to a list, scan the whole list for duplicates.
  2. Make that one item a set and then union sets.

I should probably just read up on sets in hindsight...

In Python, by the way - sorry for not clarifying.


回答1:


as you can see extending one list by another end then remove duplicates by making set is the fastest way(at least in python;))

>>> def foo():
...     """
...     extending one list by another end then remove duplicates by making set
...     """
...     l1 = range(200)
...     l2 = range(150, 250)
...     l1.extend(l2)
...     set(l1)
... 
>>> def bar():
...     """
...     checking if element is on one list end adding it only if not
...     """
...     l1 = range(200)
...     l2 = range(150, 250)
...     for elem in l2:
...             if elem not in l1:
...                     l1.append(elem)
... 
>>> def baz():
...     """
...     making sets from both lists and then union from them
...     """
...     l1 = range(200)
...     l2 = range(150, 250)
...     set(l1) | set(l2)
... 
>>> from timeit import Timer
>>> Timer(foo).timeit(10000)
0.265153169631958
>>> Timer(bar).timeit(10000)
7.921358108520508
>>> Timer(baz).timeit(10000)
0.3845551013946533
>>> 



回答2:


I really like the approach virhilo did, but it's a pretty specific set of data he was testing. In all this don't just test the functions, but test them how you'll be doing it. I put together a much more exhaustive test set. It runs each function you specify (with just a little decorator) through a list of comparisons, and figures out how long each function takes and therefore how much slower it is. The result is that it's not always clear which function you should be doing without knowing more about the size, overlap and type of your data.

Here's my test program, below will be the output.

from timeit import Timer
from copy import copy
import random
import sys

funcs = []

class timeMe(object):
    def __init__(self, f):
        funcs.append(f)
        self.f = f

    def __call__(self, *args, **kwargs):
        return self.f(*args, **kwargs)

@timeMe
def extend_list_then_set(input1, input2):
    """
    extending one list by another end then remove duplicates by making set
    """
    l1 = copy(input1)
    l2 = copy(input2)
    l1.extend(l2)
    set(l1)

@timeMe
def per_element_append_to_list(input1, input2):
    """
    checking if element is on one list end adding it only if not
    """
    l1 = copy(input1)
    l2 = copy(input2)
    for elem in l2:
            if elem not in l1:
                    l1.append(elem)

@timeMe
def union_sets(input1, input2):
    """
    making sets from both lists and then union from them
    """
    l1 = copy(input1)
    l2 = copy(input2)
    set(l1) | set(l2)

@timeMe
def set_from_one_add_from_two(input1, input2):
    """
    make set from list 1, then add elements for set 2
    """
    l1 = copy(input1)
    l2 = copy(input2)
    l1 = set(l1)
    for element in l2:
        l1.add(element)

@timeMe
def set_from_one_union_two(input1, input2):
    """
    make set from list 1, then union list 2
    """
    l1 = copy(input1)
    l2 = copy(input2)
    x = set(l1).union(l2)

@timeMe
def chain_then_set(input1, input2):
    """
    chain l1 & l2, then make a set out of that
    """
    l1 = copy(input1)
    l2 = copy(input2)
    set(itertools.chain(l1, l2))

def run_results(l1, l2, times):
    for f in funcs:
        t = Timer('%s(l1, l2)' % f.__name__,
            'from __main__ import %s; l1 = %s; l2 = %s' % (f.__name__, l1, l2))
        yield (f.__name__, t.timeit(times))    

test_datasets = [
    ('original, small, some overlap', range(200), range(150, 250), 10000),
    ('no overlap: l1 = [1], l2 = [2..100]', [1], range(2, 100), 10000),
    ('lots of overlap: l1 = [1], l2 = [1]*100', [1], [1]*100, 10000),
    ('50 random ints below 2000 in each', [random.randint(0, 2000) for x in range(50)], [random.randint(0, 2000) for x in range(50)], 10000),
    ('50 elements in each, no overlap', range(50), range(51, 100), 10000),
    ('50 elements in each, total overlap', range(50), range(50), 10000),
    ('500 random ints below 500 in each', [random.randint(0, 500) for x in range(500)], [random.randint(0, 500) for x in range(500)], 1000),
    ('500 random ints below 2000 in each', [random.randint(0, 2000) for x in range(500)], [random.randint(0, 2000) for x in range(500)], 1000),
    ('500 random ints below 200000 in each', [random.randint(0, 200000) for x in range(500)], [random.randint(0, 200000) for x in range(500)], 1000),
    ('500 elements in each, no overlap', range(500), range(501, 1000), 10000),
    ('500 elements in each, total overlap', range(500), range(500), 10000),
    ('10000 random ints below 200000 in each', [random.randint(0, 200000) for x in range(10000)], [random.randint(0, 200000) for x in range(10000)], 50),
    ('10000 elements in each, no overlap', range(10000), range(10001, 20000), 10),
    ('10000 elements in each, total overlap', range(10000), range(10000), 10),
    ('original lists 100 times', range(200)*100, range(150, 250)*100, 10),
]

fullresults = []
for description, l1, l2, times in test_datasets:
    print "Now running %s times: %s" % (times, description)
    results = list(run_results(l1, l2, times))
    speedresults = [x for x in sorted(results, key=lambda x: x[1])]
    for name, speed in results:
        finish = speedresults.index((name, speed)) + 1
        timesslower = speed / speedresults[0][1]
        fullresults.append((description, name, speed, finish, timesslower))
        print '\t', finish, ('%.2fx' % timesslower).ljust(10), name.ljust(40), speed

print
import csv
out = csv.writer(sys.stdout)
out.writerow(('Test', 'Function', 'Speed', 'Place', 'timesslower'))
out.writerows(fullresults)

The results

My point here is to encourage you to test with your data, so I don't want to harp on specifics. However... The first extend method is the fastest average method, but set_from_one_union_two (x = set(l1).union(l2)) wins a few of times. You can get more details if you run the script yourself.

The numbers I'm reporting are the number of times slower this function is than the fatest function on that test. If it was the fastest, it will be 1.

                                            Functions                                                                                                                           
Tests                                       extend_list_then_set     per_element_append_to_list    set_from_one_add_from_two  set_from_one_union_two     union_sets      chain_then_set
original, small, some overlap               1                          25.04                        1.53                        1.18                       1.39           1.08
no overlap: l1 = [1], l2 = [2..100]         1.08                       13.31                        2.10                        1                          1.27           1.07
lots of overlap: l1 = [1], l2 = [1]*100     1.10                        1.30                        2.43                        1                          1.25           1.05
50 random ints below 2000 in each           1                           7.76                        1.35                        1.20                       1.31           1   
50 elements in each, no overlap             1                           9.00                        1.48                        1.13                       1.18           1.10
50 elements in each, total overlap          1.08                        4.07                        1.64                        1.04                       1.41           1   
500 random ints below 500 in each           1.16                       68.24                        1.75                        1                          1.28           1.03
500 random ints below 2000 in each          1                         102.42                        1.64                        1.43                       1.81           1.20
500 random ints below 200000 in each        1.14                      118.96                        1.99                        1.52                       1.98           1   
500 elements in each, no overlap            1.01                      145.84                        1.86                        1.25                       1.53           1   
500 elements in each, total overlap         1                          53.10                        1.95                        1.16                       1.57           1.05          
10000 random ints below 200000 in each      1                        2588.99                        1.73                        1.35                       1.88           1.12
10000 elements in each, no overlap          1                        3164.01                        1.91                        1.26                       1.65           1.02
10000 elements in each, total overlap       1                        1068.67                        1.89                        1.26                       1.70           1.05
original lists 100 times                    1.11                     2068.06                        2.03                        1                          1.04           1.17

                                 Average    1.04                      629.25                       1.82                         1.19                       1.48           1.06
                      Standard Deviation    0.05                     1040.76                       0.26                         0.15                       0.26           0.05
                                     Max    1.16                     3164.01                       2.43                         1.52                       1.98           1.20



回答3:


All depends of what you have as input and want as output.

If you have a list li at the beginning and want to get a modified list in the end, then the faster method is if not elt in li: li.append(elt) the problem is converting initial list to set, then converting back to list which is way too slow.

But if you can work with a set s at all time (you don't care about the order of the list, and methods receiving it just need some iterable), just doing s.add(elt) is faster.

If in the beginning you have to lists and want a list in the end, even with final conversion from list set to list, it is faster to manage unicity of items using sets, but you can easily check looking at the exemple provided by @virhilo in it's answer, than concatenating the two lists using extend, then converting the result to set is faster than converting the two lists to sets and performing an union.

I don't know exactly what are the constraints of your programs, but if unicity is as important as it seems, and if keeping insertion order is not necessary, you would be well advised to use sets at all time, never changing them to lists. Most algorithms will work for both anyway, thanks to Duck Typing as they both are different kinds of iterables.




回答4:


The fastest thing you can do is build two sets from the lists and take the union of them. Both set construction from list and set union are implemented in the runtime, in very optimized C, so it is very fast.

In code, if the lists are l1 and l2, you can do

unique_elems = set(l1) | set(l2)

EDIT: as @kriss notes, extending l1 with l2 is faster. This code however doesn't change l1, and works also if l1 and l2 are generic iterables.



来源:https://stackoverflow.com/questions/4674208/is-it-faster-to-union-sets-or-check-the-whole-list-for-a-duplicate

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!