Sort a list by presence of items in another list

问题

Suppose I have two lists:

a = ['30', '10', '90', '1111', '17']
b = ['60', '1201', '30', '17', '900']

How would you sort this most efficiently, such that:

list b is sorted with respect to a. Unique elements in b should be placed at the end of the sorted list. Unique elements in a can be ignored.

example output:

c = ['30', '17', '60', '1201', '900']

Sorry, it's a simple question. My attempt is stuck at the point of taking the intersection.

intersection = sorted(set(a) & set(b), key = a.index)

回答1:

There is no need to actually sort here. You want the elements in a which are in b, in the same order as they were in a; followed by the elements in b which are not in a, in the same order as they were in b.

We can just do this with two filters, using the sets for fast membership tests:

>>> a = ['30', '10', '90', '1111', '17']
>>> b = ['60', '1201', '30', '17', '900']
>>> a_set = set(a)
>>> b_set = set(b)
>>> [*filter(lambda x: x in b_set, a), *filter(lambda x: x not in a_set, b)]
['30', '17', '60', '1201', '900']

Or if you prefer comprehensions:

>>> [*(x for x in a if x in b_set), *(x for x in b if x not in a_set)]
['30', '17', '60', '1201', '900']

Both take linear time, which is better than sorting.

回答2:

You can create a custom dictionary, with the keys being the entries in a and the values their position. Then sort b according to the values in the dictionary. You can use dict.get for the lookup and inf if the value is not present:

a = ['30', '10', '90', '1111', '17']
b = ['60', '1201', '30', '17', '900']

d = {i:ix for ix, i in enumerate(a)}
#{'30': 0, '10': 1, '90': 2, '1111': 3, '17': 4}
sorted(b, key=lambda x: d.get(x, float('inf')))
#['30', '17', '60', '1201', '900']

回答3:

As you gave the hint of using set, it seems to me that the two lists contain non-duplicated items. Then you can simply do list comprehension:

c = [x for x in a if x in b] + [x for x in b if x not in a]

This is O(n^2), however. If your list is large and want to make it faster, try to build a set of a and b respectively and use them for membership check.

回答4:

Your title is actually clearer than your description and can be pretty directly translated to code:

Sort a list by presence of items in another list

Code:

>>> sorted(b, key=set(a).__contains__, reverse=True)
['30', '17', '60', '1201', '900']

>>> sorted(b, key=lambda x, s=set(a): x not in s)
['30', '17', '60', '1201', '900']

Sorting booleans is practically indistinguishable from linear time, and these solutions are faster than the accepted solution both on your example data as well as on example data I tried with millions of random numbers (where about half of b's elements were in a).

Benchmarks

   n    b in a   kaya1    kaya2    heap1    heap2    heap3
----------------------------------------------------------
   1024 53.12%  0.00046  0.00033  0.00020  0.00067  0.00018
   2048 51.03%  0.00142  0.00069  0.00048  0.00071  0.00060
   4096 50.34%  0.00226  0.00232  0.00127  0.00183  0.00125
   8192 50.42%  0.00938  0.00843  0.00328  0.00471  0.00351
  16384 50.38%  0.02010  0.01647  0.00776  0.00992  0.00839
  32768 49.96%  0.03987  0.03165  0.01661  0.02326  0.01951
  65536 50.20%  0.08002  0.06548  0.03326  0.04828  0.03896
 131072 50.04%  0.16118  0.12863  0.06671  0.09642  0.07840
 262144 50.06%  0.32698  0.26757  0.13477  0.19342  0.15828
 524288 50.08%  0.66735  0.54627  0.27378  0.38365  0.32496
1048576 50.00%  1.34095  1.08972  0.54703  0.78028  0.65623
2097152 50.03%  2.68957  2.20556  1.13797  1.60649  1.33975
4194304 50.01%  5.36141  4.33496  2.25494  3.18520  2.70506
8388608 49.99% 10.72588  8.74114  4.56061  6.35421  5.36515

Note:

n is the size of b.
a is prepared as a set before benchmarking the functions, in order to focus on their differences. The size of a is always 8388608 in order to keep in a checks constant time (even sets get slower when they get larger).
b in a is the percentage of elements of b in a. I made them so that this is about 50%.
kaya1 and kaya2 are from the accepted answer by @kaya3, modified so that they do what I think is the task (sort b by presence of items in a, not "a & b" followed by "b \ a").
heap1 and heap2 are my above two solutions using sorted.
heap3 is the fastest solution without sorted that I was able to write.
The results are times in seconds.

Benchmark code:

from timeit import repeat
import random

def kaya1(a_set, b):
    return [*filter(lambda x: x in a_set, b), *filter(lambda x: x not in a_set, b)]

def kaya2(a_set, b):
    return [*(x for x in b if x in a_set), *(x for x in b if x not in a_set)]

def heap1(a_set, b):
    return sorted(b, key=a_set.__contains__, reverse=True)

def heap2(a_set, b):
    return sorted(b, key=lambda x: x not in a_set)

def heap3(a_set, b):
    not_in_a = []
    append = not_in_a.append
    in_a = [x for x in b if x in a_set or append(x)]
    in_a.extend(not_in_a)
    return in_a

print('   n    b in a   kaya1    kaya2    heap1    heap2    heap3')
print('----------------------------------------------------------')

A = random.sample(range(2**24), 2**23)
B = random.sample(range(2**24), 2**23)
a_set = set(A)

for e in range(10, 24):
    n = 2**e
    b = B[:n]
    print('%7d %5.2f%%' % (n, 100 * len(set(b) & a_set) / len(b)), end='')
    expect = None
    for sort in kaya1, kaya2, heap1, heap2, heap3:
        t = min(repeat(lambda: sort(a_set, b), number=1))
        print('%9.5f' % t, end='')
        output = sort(a_set, b)
        if expect is None:
            expect = output
        else:
            assert output == expect
    print()

回答5:

Maybe this should work.

intersection = sorted(set(a) & set(b), key=a.index)
intersection.extend([ele for ele in b if ele not in intersection])

来源：https://stackoverflow.com/questions/60211248/sort-a-list-by-presence-of-items-in-another-list

标签

python

list

sorting