How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

后端 未结 2 472
礼貌的吻别
礼貌的吻别 2021-01-18 08:05

I\'m trying to cluster some text documents using scikit-learn. I\'m trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g.

2条回答
  •  天命终不由人
    2021-01-18 08:45

    Have you considered implementing the search yourself?

    It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.

    For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).

    In other words, at which distance are two articles supposed to be clustered?

    If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.

    Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

提交回复
热议问题