Keep all elements in one list from another

问题

I have two large lists train and keep, with the latter containing unique elements, for e.g.

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

Is there a way to create a new list that has all the elements of train that are in keep using sets? The end result should be:

train_keep = [1, 3, 4, 3, 1]

Currently I'm using itertools.filterfalse from how to keep elements of a list based on another list but it is very slow as the lists are large...

回答1:

Convert the list keep into a set, since that will be checked frequently. Iterate over train, since you want to keep order and repeats. That makes set not an option. Even if it was, it wouldn't help, since the iteration would have to happen anyway:

keeps = set(keep)
train_keep = [k for k in train if k in keeps]

A lazier, and probably slower version would be something like

train_keep = filter(lambda x: x in keeps, train)

Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects. Here is a sample numpy solution:

train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]

This is likely an O(M * N) algorithm rather than O(M) set lookup, but if checking N elements in keep is faster than a nominally O(1) lookup, you win.

You can get something closer to O(M log(N)) using sorted lookup:

train = np.array([...])
keep = np.array([...])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]

A better alternative might be to append np.inf or a maximum out-of-bounds integer to the sorted keep array, so you don't have to distinguish missing from edge elements with extra at all. Something like np.max(train.max() + 1, keep.max()) would do:

train = np.array([...])
keep = np.array([... 99999])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]

For random inputs with train.size = 10000 and keep.size = 10, the numpy method is ~10x faster on my laptop.

回答2:

>>> keep_set = set(keep)
>>> [val for val in train if val in keep_set]
[1, 3, 4, 3, 1]

Note that if keep is small, there might not be any performance advantage to converting it to a set (benchmark to make sure).

回答3:

this is an option:

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

keep_set = set(keep)
res = [item for item in train if item in keep_set]
# [1, 3, 4, 3, 1]

i use keep_set in order to speed up the look-up a bit.

回答4:

The logic is the same, but give a try, maybe a generator is faster for your case:

def keep_if_in(to_keep, ary):
  for element in ary:
    if element in to_keep:
      yield element

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
train_keep = keep_if_in(set(keep), train)

Finally, convert to a list when required or iterate directly the generator:

print(list(train_keep))

#  alternatively, uncomment this and comment out the line above,
#  it's because a generator can be consumed once
#  for e in train_keep:
#    print(e)

回答5:

This is a slight expansion of Mad Physicist's clever technique, to cover a situation where the lists contain characters and one of them is a dataframe column (I was trying to find a list of items in a dataframe, including all duplicates, but the obvious answer, mylist.isin(df['col') removed the duplicates). I adapted his answer to deal with the problem of possible truncation of character data by Numpy.

#Sample dataframe with strings
d = {'train': ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510l','ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510d02','ABC_S8#Q09#2#510c#8y','ABC_S8#Q09#2#510a#6'], 'col2': [1,2,3,4,5,6]}
df = pd.DataFrame(data=d)

keep_list = ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510b13','ABC_S8#Q09#2#510c#8y']

#Make sure the Numpy datatype accomodates longest string in either list
maxlen = max(len(max(keep_list, key = len)),len(max(df['train'], key = len))) 
strtype = '<U'+ str(maxlen) 

#Convert lists to Numpy arrays
keep = np.array(keep_list,dtype = strtype)
train = np.array(df['train'],dtype = strtype)

#Algorithm
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = df[keep[ind] == df['train']] #reference the original dataframe

I found this to be much faster than other solutions I tried.

来源：https://stackoverflow.com/questions/57162293/keep-all-elements-in-one-list-from-another

标签

python

list

set

match

unique