I have two big arrays to work on. But let\'s take a look on the following simplified example to get the idea:
I would like to find if an element in data1
Note The other answers, using a dictionary (for checking exact matches) or a KDTree (for epsilon-close matches), are much better than this—both much faster and much more memory-efficient.
Use scipy.spatial.distance.cdist. If your two data arrays have N
and M
entries each, it will make an N
by M
pairwise distance array. If you can fit that in RAM, then it's easy to find the indexes that match:
import numpy as np
from scipy.spatial.distance import cdist
# Generate some data that's very likely to have repeats
a = np.random.randint(0, 100, (1000, 2))
b = np.random.randint(0, 100, (1000, 2))
# `cityblock` is likely the cheapest distance to calculate (no sqrt, etc.)
c = cdist(a, b, 'cityblock')
# And the indexes of all the matches:
aidx, bidx = np.nonzero(c == 0)
# sanity check:
print([(a[i], b[j]) for i,j in zip(aidx, bidx)])
The above prints out:
[(array([ 0, 84]), array([ 0, 84])),
(array([50, 73]), array([50, 73])),
(array([53, 86]), array([53, 86])),
(array([96, 85]), array([96, 85])),
(array([95, 18]), array([95, 18])),
(array([ 4, 59]), array([ 4, 59])), ... ]