可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm given a 2-D numpy array X consisting of floating values and need to compute the euclidean distances between all pairs of rows, then compute the top k row indices with the smallest distances and return them (where k > 0). I'm testing with a small array and this is what I have so far...
import numpy as np from sklearn.metrics.pairwise import euclidean_distances X_testing = np.asarray([[1,2,3.5],[4,1,2],[0,0,2],[3.4,1,5.6]]) test = euclidean_distances(X_testing, X_testing) print(test)
The resulting printout is:
[[ 0. 3.5 2.6925824 3.34215499] [ 3.5 0. 4.12310563 3.64965752] [ 2.6925824 4.12310563 0. 5.05173238] [ 3.34215499 3.64965752 5.05173238 0. ]]
Next, I need to efficiently compute the top k smallest distances between all pairs of rows, and return the corresponding k tuples of (row1, row2, distance_value) in order in the form of a list.
So in the above test case, if k = 2, then I would need to return the following:
[(0, 2, 2.6925824), (0, 3, 3.34215499)]
Is there a built-in way (in either scipy, sklearn, numpy, etc.), or any other way to help compute this efficiently? Although the above test case is small, in reality the 2-D array is very large so memory and time is a concern. Thanks
回答1:
Using scipy.spatial
instead of sklearn
(which I haven't installed yet) I can get the same distance matrix:
In [623]: from scipy import spatial In [624]: pdist=spatial.distance.pdist(X_testing) In [625]: pdist Out[625]: array([ 3.5 , 2.6925824 , 3.34215499, 4.12310563, 3.64965752, 5.05173238]) In [626]: D=spatial.distance.squareform(pdist) In [627]: D Out[627]: array([[ 0. , 3.5 , 2.6925824 , 3.34215499], [ 3.5 , 0. , 4.12310563, 3.64965752], [ 2.6925824 , 4.12310563, 0. , 5.05173238], [ 3.34215499, 3.64965752, 5.05173238, 0. ]])
pdist
is in condensed form, whose indicies in the squareform can be found with
In [629]: np.triu_indices(4,1) Out[629]: (array([0, 0, 0, 1, 1, 2], dtype=int32), array([1, 2, 3, 2, 3, 3], dtype=int32))
The 2 smallest distances are the 1st 2 values of
In [630]: idx=np.argsort(pdist) In [631]: idx Out[631]: array([1, 2, 0, 4, 3, 5], dtype=int32)
So we want [1,2]
from pdist
and the corresponding elements of the triu
:
In [633]: pdist[idx[:2]] Out[633]: array([ 2.6925824 , 3.34215499]) In [634]: np.transpose(np.triu_indices(4,1))[idx[:2],:] Out[634]: array([[0, 2], [0, 3]], dtype=int32)
and to collect those values as a list of tuples:
In [636]: I,J = np.triu_indices(4,1) In [637]: kbig = idx[:2] In [638]: [(i,j,d) for i,j,d in zip(I[kbig], J[kbig], pdist[kbig])] Out[638]: [(0, 2, 2.6925824035672519), (0, 3, 3.3421549934136805)]
Numpy array of distances to list of (row,col,distance)
回答2:
This is by example, but incorporates a list comprehension so you can see the slicing. Obviously not a speed demon, but more for understanding.
>>> import numpy as np >>> a = np.random.randint(0,10, size=(5,5)) >>> a array([[8, 3, 3, 8, 9], [0, 8, 6, 6, 5], [6, 7, 6, 5, 0], [4, 2, 4, 0, 3], [4, 1, 3, 2, 2]]) >>> idx = np.argsort(a, axis=1) >>> idx array([[1, 2, 0, 3, 4], [0, 4, 2, 3, 1], [4, 3, 0, 2, 1], [3, 1, 4, 0, 2], [1, 3, 4, 2, 0]]) >>> v = np.vstack([ a[i][idx[i]] for i in range(len(idx))]) >>> v array([[3, 3, 8, 8, 9], [0, 5, 6, 6, 8], [0, 5, 6, 6, 7], [0, 2, 3, 4, 4], [1, 2, 2, 3, 4]]) >>> >>> v3 = np.vstack([ a[i][idx[i]][:3] for i in range(len(idx))]) >>> v3 array([[3, 3, 8], [0, 5, 6], [0, 5, 6], [0, 2, 3], [1, 2, 2]]) >>>
You can mess around with the slicing and put it full np if you like.