Quick method to enumerate two big arrays?

前端 未结 3 851
死守一世寂寞
死守一世寂寞 2021-01-16 07:56

I have two big arrays to work on. But let\'s take a look on the following simplified example to get the idea:

I would like to find if an element in data1

3条回答
  •  深忆病人
    2021-01-16 08:51

    Note The other answers, using a dictionary (for checking exact matches) or a KDTree (for epsilon-close matches), are much better than this—both much faster and much more memory-efficient.

    Use scipy.spatial.distance.cdist. If your two data arrays have N and M entries each, it will make an N by M pairwise distance array. If you can fit that in RAM, then it's easy to find the indexes that match:

    import numpy as np
    from scipy.spatial.distance import cdist
    
    # Generate some data that's very likely to have repeats    
    a = np.random.randint(0, 100, (1000, 2))
    b = np.random.randint(0, 100, (1000, 2))
    
    # `cityblock` is likely the cheapest distance to calculate (no sqrt, etc.)
    c = cdist(a, b, 'cityblock')
    
    # And the indexes of all the matches:
    aidx, bidx = np.nonzero(c == 0)
    
    # sanity check:
    print([(a[i], b[j]) for i,j in zip(aidx, bidx)])
    

    The above prints out:

    [(array([ 0, 84]), array([ 0, 84])),
     (array([50, 73]), array([50, 73])),
     (array([53, 86]), array([53, 86])),
     (array([96, 85]), array([96, 85])),
     (array([95, 18]), array([95, 18])),
     (array([ 4, 59]), array([ 4, 59])), ... ]
    

提交回复
热议问题