问题
Suppose I have two lists:
a = ['30', '10', '90', '1111', '17']
b = ['60', '1201', '30', '17', '900']
How would you sort this most efficiently, such that:
list b
is sorted with respect to a
. Unique elements in b
should be placed at the end of the sorted list. Unique elements in a
can be ignored.
example output:
c = ['30', '17', '60', '1201', '900']
Sorry, it's a simple question. My attempt is stuck at the point of taking the intersection.
intersection = sorted(set(a) & set(b), key = a.index)
回答1:
There is no need to actually sort here. You want the elements in a
which are in b
, in the same order as they were in a
; followed by the elements in b
which are not in a
, in the same order as they were in b
.
We can just do this with two filters, using the sets for fast membership tests:
>>> a = ['30', '10', '90', '1111', '17']
>>> b = ['60', '1201', '30', '17', '900']
>>> a_set = set(a)
>>> b_set = set(b)
>>> [*filter(lambda x: x in b_set, a), *filter(lambda x: x not in a_set, b)]
['30', '17', '60', '1201', '900']
Or if you prefer comprehensions:
>>> [*(x for x in a if x in b_set), *(x for x in b if x not in a_set)]
['30', '17', '60', '1201', '900']
Both take linear time, which is better than sorting.
回答2:
You can create a custom dictionary, with the keys being the entries in a
and the values their position. Then sort b
according to the values in the dictionary. You can use dict.get
for the lookup and inf
if the value is not present:
a = ['30', '10', '90', '1111', '17']
b = ['60', '1201', '30', '17', '900']
d = {i:ix for ix, i in enumerate(a)}
#{'30': 0, '10': 1, '90': 2, '1111': 3, '17': 4}
sorted(b, key=lambda x: d.get(x, float('inf')))
#['30', '17', '60', '1201', '900']
回答3:
As you gave the hint of using set
, it seems to me that the two lists contain non-duplicated items. Then you can simply do list comprehension:
c = [x for x in a if x in b] + [x for x in b if x not in a]
This is O(n^2), however. If your list is large and want to make it faster, try to build a set of a
and b
respectively and use them for membership check.
回答4:
Your title is actually clearer than your description and can be pretty directly translated to code:
Sort a list by presence of items in another list
Code:
>>> sorted(b, key=set(a).__contains__, reverse=True)
['30', '17', '60', '1201', '900']
or
>>> sorted(b, key=lambda x, s=set(a): x not in s)
['30', '17', '60', '1201', '900']
Sorting booleans is practically indistinguishable from linear time, and these solutions are faster than the accepted solution both on your example data as well as on example data I tried with millions of random numbers (where about half of b
's elements were in a
).
Benchmarks
n b in a kaya1 kaya2 heap1 heap2 heap3
----------------------------------------------------------
1024 53.12% 0.00046 0.00033 0.00020 0.00067 0.00018
2048 51.03% 0.00142 0.00069 0.00048 0.00071 0.00060
4096 50.34% 0.00226 0.00232 0.00127 0.00183 0.00125
8192 50.42% 0.00938 0.00843 0.00328 0.00471 0.00351
16384 50.38% 0.02010 0.01647 0.00776 0.00992 0.00839
32768 49.96% 0.03987 0.03165 0.01661 0.02326 0.01951
65536 50.20% 0.08002 0.06548 0.03326 0.04828 0.03896
131072 50.04% 0.16118 0.12863 0.06671 0.09642 0.07840
262144 50.06% 0.32698 0.26757 0.13477 0.19342 0.15828
524288 50.08% 0.66735 0.54627 0.27378 0.38365 0.32496
1048576 50.00% 1.34095 1.08972 0.54703 0.78028 0.65623
2097152 50.03% 2.68957 2.20556 1.13797 1.60649 1.33975
4194304 50.01% 5.36141 4.33496 2.25494 3.18520 2.70506
8388608 49.99% 10.72588 8.74114 4.56061 6.35421 5.36515
Note:
n
is the size ofb
.a
is prepared as aset
before benchmarking the functions, in order to focus on their differences. The size ofa
is always8388608
in order to keepin a
checks constant time (evenset
s get slower when they get larger).b in a
is the percentage of elements ofb
ina
. I made them so that this is about 50%.kaya1
andkaya2
are from the accepted answer by @kaya3, modified so that they do what I think is the task (sortb
by presence of items ina
, not "a & b
" followed by "b \ a
").heap1
andheap2
are my above two solutions usingsorted
.heap3
is the fastest solution withoutsorted
that I was able to write.- The results are times in seconds.
Benchmark code:
from timeit import repeat
import random
def kaya1(a_set, b):
return [*filter(lambda x: x in a_set, b), *filter(lambda x: x not in a_set, b)]
def kaya2(a_set, b):
return [*(x for x in b if x in a_set), *(x for x in b if x not in a_set)]
def heap1(a_set, b):
return sorted(b, key=a_set.__contains__, reverse=True)
def heap2(a_set, b):
return sorted(b, key=lambda x: x not in a_set)
def heap3(a_set, b):
not_in_a = []
append = not_in_a.append
in_a = [x for x in b if x in a_set or append(x)]
in_a.extend(not_in_a)
return in_a
print(' n b in a kaya1 kaya2 heap1 heap2 heap3')
print('----------------------------------------------------------')
A = random.sample(range(2**24), 2**23)
B = random.sample(range(2**24), 2**23)
a_set = set(A)
for e in range(10, 24):
n = 2**e
b = B[:n]
print('%7d %5.2f%%' % (n, 100 * len(set(b) & a_set) / len(b)), end='')
expect = None
for sort in kaya1, kaya2, heap1, heap2, heap3:
t = min(repeat(lambda: sort(a_set, b), number=1))
print('%9.5f' % t, end='')
output = sort(a_set, b)
if expect is None:
expect = output
else:
assert output == expect
print()
回答5:
Maybe this should work.
intersection = sorted(set(a) & set(b), key=a.index)
intersection.extend([ele for ele in b if ele not in intersection])
来源:https://stackoverflow.com/questions/60211248/sort-a-list-by-presence-of-items-in-another-list