scikit-learn: clustering text documents using DBSCAN

前端 未结 2 1732
Happy的楠姐
Happy的楠姐 2020-12-24 13:23

I\'m tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illus

相关标签:
2条回答
  • 2020-12-24 13:52

    The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

    Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

    You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

    You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

    Mean Shift may actually need your data to be vector space of fixed dimensionality.

    0 讨论(0)
  • 2020-12-24 14:18

    It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

    I upgraded sklearn to 0.16.1 and it worked for me on text.

    0 讨论(0)
提交回复
热议问题