scikit-learn: clustering text documents using DBSCAN

不想你离开。 提交于 2019-11-30 04:47:48

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

Mean Shift may actually need your data to be vector space of fixed dimensionality.

cyniphile

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

I upgraded sklearn to 0.16.1 and it worked for me on text.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!