scikit-learn: clustering text documents using DBSCAN

前端未结

关注

 2  1732

I\'m tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illus

相关标签:

2条回答

栀梦

2020-12-24 13:52

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

Mean Shift may actually need your data to be vector space of fixed dimensionality.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2020-12-24 14:18

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

I upgraded sklearn to 0.16.1 and it worked for me on text.

0 讨论(0)
发布评论:

提交评论
- 加载中...