efficient way of constructing a matrix of pair-wise distances between many vectors?

限于喜欢 提交于 2021-02-11 08:10:37

问题


First, thanks for reading and taking the time to respond.

Second, the question:

I have a PxN matrix X where P is in the order of 10^6 and N is in the order of 10^3. So, X is relatively large and is not sparse. Let's say each row of X is an N-dimensional sample. I want to construct a PxP matrix of pairwise distances between these P samples. Let's also say I am interested in Hellinger distances.

So far I am relying on sparse dok matrices:

def hellinger_distance(X):
    P = X.shape[0]
    H1 = sp.sparse.dok_matrix((P, P))
    for i in xrange(P):
        if i%100 == 0:
            print i
        x1 = X[i]
        X2 = X[i:P]
        h = np.sqrt(((np.sqrt(x1) - np.sqrt(X2))**2).sum(1)) / math.sqrt(2)       
        H1[i, i:P] = h
    H = H1 + H1.T
    return H

This is super slow. Is there a more efficient way of doing this? Any help is much appreciated.


回答1:


You can use pdist and squareform from scipy.spatial.distance -

from scipy.spatial.distance import pdist, squareform

out = squareform(pdist(np.sqrt(X)))/np.sqrt(2)

Or use cdist from the same -

from scipy.spatial.distance import cdist

sX = np.sqrt(X)
out = cdist(sX,sX)/np.sqrt(2)



回答2:


In addition to Divakar's response, I realized that there is an implementation of this in sklearn which allows parallel processing:

from sklearn.metrics.pairwise import pairwise_distances
njobs = 3
H = pairwise_distances(np.sqrt(X), n_jobs=njobs, metric='euclidean') / math.sqrt(2)

I will do some benchmarking and post the results later.



来源:https://stackoverflow.com/questions/32998842/efficient-way-of-constructing-a-matrix-of-pair-wise-distances-between-many-vecto

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!