DBSCAN error with cosine metric in python

匿名 (未验证) 提交于 2019-12-03 02:30:02

问题:

I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is

db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X)     

where X is a csr_matrix. The error is the following:

Metric 'cosine' not valid for algorithm 'auto',

though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1 metric.

The matrix X is large, so I can't use a precomputed matrix of pairwise distances.

I use python 2.7.6 and scikit-learn 0.16.1. My dataset doesn't have a full row of zeros, so cosine metric is well-defined.

回答1:

The indexes in sklearn (probably - this may change with new versions) cannot accelerate cosine.

Try algorithm='brute'.

For a list of metrics that your version of sklearn can accelerate, see the supported metrics of the ball tree:

from sklearn.neighbors.ball_tree import BallTree print(BallTree.valid_metrics) 


回答2:

If you want a normalized distance like the cosine distance, you can also normalize your vectors first and then use the euclidean metric. Notice that for two normalized vectors u and v the euclidean distance is equal to sqrt(2-2*cos(u, v)) (see this discussion)

You can hence do something like:

Xnorm = np.linalg.norm(X,axis = 1) Xnormed = np.divide(X,Xnorm.reshape(Xnorm.shape[0],1)) db = DBSCAN(eps=0.5, min_samples=2, metric='euclidean').fit(Xnormed)  

The distances will lie in [0,2] so make sure you adjust your parameters accordingly.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!