Quick method to enumerate two big arrays?

前端未结

关注

 3  851

死守一世寂寞 2021-01-16 07:56

I have two big arrays to work on. But let\'s take a look on the following simplified example to get the idea:

I would like to find if an element in data1

3条回答

深忆病人 (楼主)

2021-01-16 08:51

Note The other answers, using a dictionary (for checking exact matches) or a KDTree (for epsilon-close matches), are much better than this—both much faster and much more memory-efficient.

Use scipy.spatial.distance.cdist. If your two data arrays have N and M entries each, it will make an N by M pairwise distance array. If you can fit that in RAM, then it's easy to find the indexes that match:

import numpy as np
from scipy.spatial.distance import cdist

# Generate some data that's very likely to have repeats    
a = np.random.randint(0, 100, (1000, 2))
b = np.random.randint(0, 100, (1000, 2))

# `cityblock` is likely the cheapest distance to calculate (no sqrt, etc.)
c = cdist(a, b, 'cityblock')

# And the indexes of all the matches:
aidx, bidx = np.nonzero(c == 0)

# sanity check:
print([(a[i], b[j]) for i,j in zip(aidx, bidx)])

The above prints out:

[(array([ 0, 84]), array([ 0, 84])),
 (array([50, 73]), array([50, 73])),
 (array([53, 86]), array([53, 86])),
 (array([96, 85]), array([96, 85])),
 (array([95, 18]), array([95, 18])),
 (array([ 4, 59]), array([ 4, 59])), ... ]

0 讨论(0)

查看其它3个回答