Cutting dendrogram at highest level of purity

坚强是说给别人听的谎言 提交于 2019-12-11 23:34:03

问题


I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity.

So following is the algorithm I am working on right now.

Create dedrogram for the documents in the dataset
purity = 0
final_clusters
for all the levels, lvl, in the dendrogram
    clusters = cut dendrogram at lvl
    new_purity = calculate_purity_of(clusters)
    if new_purity > purity
        purity = new_purity
        final_clusters = clusters

according to this algorithm I get the clusters at which the purity calculated is highest at all the levels.

The problem is, when I cut the dendrogram at lowest level, every cluster contains only one document, which means it is 100% pure, therefore average purity of clusters is 1.0. But this is not the desired output. What I want is proper grouping of documents. Am I doing something wrong?


回答1:


You are using a too simple measure.

Yes, the "optimal" solution with respect to purity is to only merge duplicate objects, so that each cluster remains pure by definition.

This is why optimizing a mathematical criterion often isn't the right approach to tackle a real data problem. Instead, you need to ask yourself the question: "what would be an interesting result", where interesting is not the same as optimal in a mathematical sense.

Sorry that I cannot give you a better answer - but I don't have your data.

IMHO, any abstract mathematical approach will suffer from the same fate. You need to have your data and user needs specify what to cluster, not some statistical number; so don't look in mathematics for the answer, but look at your data and your user needs.



来源:https://stackoverflow.com/questions/22317813/cutting-dendrogram-at-highest-level-of-purity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!