Sparse implementations of distance computations in python / scikit-learn

孤者浪人 提交于 2020-01-21 07:19:27

问题


I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("somefile_svm.txt")

which returns a sparse scipy array X

I simply need to compute the pairwise distances of all training points as

D = pdist(X)

Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as

D = pdist(X.todense())

Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.

Many thanks


回答1:


In scikit-learn there is a sklearn.metrics.euclidean_distances function that works both for sparse matrices and dense numpy arrays. See the reference documentation.

However non-euclidean distances are not yet implemented for sparse matrices.



来源:https://stackoverflow.com/questions/8956274/sparse-implementations-of-distance-computations-in-python-scikit-learn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!