Python Compare Tokenized Lists

社会主义新天地 提交于 2019-12-11 06:06:03

问题


I need the fastest-possible solution to this problem as it will be applied to a huge data set:

Given this master list:

m=['abc','bcd','cde','def']

...and this reference list of lists:

r=[['abc','def'],['bcd','cde'],['abc','def','bcd']]

I'd like to compare each list within r to the master list (m) and generate a new list of lists. This new object will have a 1 for matches based on the order in m and 0 for non-matches. So the new object (list of lists) will always have the lists of the same length as m. Here's what I would expect based on m and r above:

[[1,0,0,1],[0,1,1,0],[1,1,0,1]]

Because the first element of r is ['abc','def'] and has a match with the 1st and 4th elements of m, the result is then [1,0,0,1].

Here's my approach so far (probably way too slow and is missing zeros):

output=[]
for i in r:
    output.append([1 for x in m if x in i])

resulting in:

[[1, 1], [1, 1], [1, 1, 1]]

Thanks in advance!


回答1:


You can use a nested list comprehension like this:

>>> m = ['abc','bcd','cde','def']
>>> r = [['abc','def'],['bcd','cde'],['abc','def','bcd']]
>>> [[1 if mx in rx else 0 for mx in m] for rx in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]

Also, you could shorten the 1 if ... else 0 using int(...), and you can convert the sublists of r to set, so that the individual mx in rx lookups are faster.

>>> [[int(mx in rx) for mx in m] for rx in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
>>> [[int(mx in rx) for mx in m] for rx in map(set, r)]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]

While int(...) is a bit shorter than 1 if ... else 0, it also seems to be slower, so you probably should not use that. Converting the sublists of r to set prior to the repeated lookup should speed things up for longer lists, but for you very short example lists, it's in fact slower than the naive approach.

>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in r]
100000 loops, best of 3: 4.74 µs per loop
>>> %timeit [[int(mx in rx) for mx in m] for rx in r]
100000 loops, best of 3: 8.07 µs per loop
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in map(set, r)]
100000 loops, best of 3: 5.82 µs per loop

For longer lists, using set becomes faster, as would be expected:

>>> m = [random.randint(1, 100) for _ in range(50)]
>>> r = [[random.randint(1,100) for _ in range(10)] for _ in range(20)]
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in r]
1000 loops, best of 3: 412 µs per loop
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in map(set, r)]
10000 loops, best of 3: 208 µs per loop



回答2:


You were almost there.

You want to add 1 if the x is in i and 0 if it is not, for every x in m.

So the script would look like it sounds: 1 if x in i else 0 as condition, for x in m:

output = [[1 if x in i else 0 for x in m] for i in r]
print(output)

Results with

[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]



回答3:


One approach using np.in1d with one loop -

np.array([np.in1d(m,i) for i in r]).astype(int)

With explicit-loop it would look something like this -

out = np.empty((len(r),len(m)),dtype=int)
for i,item in enumerate(r):
    out[i] = np.in1d(m,item)

We can use dtype=bool for memory and performance.

Sample run -

In [18]: m
Out[18]: ['abc', 'bcd', 'cde', 'def']

In [19]: r
Out[19]: [['abc', 'def'], ['bcd', 'cde'], ['abc', 'def', 'bcd']]

In [20]: np.array([np.in1d(m,i) for i in r]).astype(int)
Out[20]: 
array([[1, 0, 0, 1],
       [0, 1, 1, 0],
       [1, 1, 0, 1]])

If r had lists with equal lengths, we could have used a fully vectorized approach.




回答4:


Without numpy, you may do it using nested list comprehension as:

>>> m = ['abc','bcd','cde','def']
>>> r = [['abc','def'],['bcd','cde'],['abc','def','bcd']]

>>> [[int(mm in rr) for mm in m] for rr in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]

Actually you do not need the type-casting to int because Python treats False as 0 and True as 1. Also, using bool value is more memory efficient. Hence, you expression will look like:

>>> [[mm in rr for mm in m] for rr in r]
[[True, False, False, True], [False, True, True, False], [True, True, False, True]]



回答5:


Multiprocessing to the rescue!

import multiprocessing as mp

def matcher(qIn, qOut):
    m = set(['abc','bcd','cde','def'])
    for i,L in iter(qIn.get, None):
        answer = [1 if e in m else 0 for e in L]
        qOut.put((i,answer))


def main(L):
    qIn, qOut = [mp.Queue() for _ in range(2)]
    procs = [mp.Process(target=matcher, args=(qIn, qOut)) for _ in range(mp.cpu_count()-1)]
    for p in procs: p.start()

    numElems = len(L)
    for t in enumerate(L): qIn.put(t)
    for p in procs: qIn.put(None)

    done = 0
    while done < numElems:
        i,answer = qIn.get()
        L[i] = answer
        done += 1

    for p in procs: p.terminate()

if __name__ == "__main__":
    L = [['abc','def'],['bcd','cde'],['abc','def','bcd']]
    main(L)
    # now L looks like the required output


来源:https://stackoverflow.com/questions/40665795/python-compare-tokenized-lists

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!