Should one use distances (dissimilarities) or similarities in R for clustering?

爱⌒轻易说出口 提交于 2019-12-06 14:55:31

I'm not sure what you mean by not as per expected. If I compute the distance/similarity matrix via proxy::dist() or via simil() and convert to a dissimilarity I get the same matrix:

> dist(dfm, method='Pearson')
                                  Gawker Read/WriteWeb WWdN: In Exile ProBlogger Blog Tips Seth's Blog
Read/WriteWeb                  0.2662006                                                              
WWdN: In Exile                 0.2822594     0.2662006                                                
ProBlogger Blog Tips           0.2928932     0.5917517      0.6984887                                 
Seth's Blog                    0.2662006     0.2928932      0.4072510            0.2928932            
The Huffington Post | Raw Feed 0.1835034     0.2312939      0.2662006            0.2928932   0.2312939

> pr_simil2dist(simil(dfm, method = "pearson"))
                                  Gawker Read/WriteWeb WWdN: In Exile ProBlogger Blog Tips Seth's Blog
Read/WriteWeb                  0.2662006                                                              
WWdN: In Exile                 0.2822594     0.2662006                                                
ProBlogger Blog Tips           0.2928932     0.5917517      0.6984887                                 
Seth's Blog                    0.2662006     0.2928932      0.4072510            0.2928932            
The Huffington Post | Raw Feed 0.1835034     0.2312939      0.2662006            0.2928932   0.2312939

and

d1 <- dist(dfm, method='Pearson')
d2 <- pr_simil2dist(simil(dfm, method = "pearson"))
h1 <- hclust(d1)
h2 <- hclust(d2)
layout(matrix(1:2, ncol = 2))
plot(h1)
plot(h2)
layout(1)
all.equal(h1, h2)

The last line yields:

> all.equal(h1, h2)
[1] "Component 6: target, current do not match when deparsed"

which is telling us that h1 and h2 are exactly the same except for the matched function call (obviously as we used d1 and d2 in the respective calls).

The figure produced is:

If you set your object up correctly, then you won't need to fiddle with the labels. Look at the row.names argument to read.table() to see how to specify a column be used as the row labels when the data are read in.

All of this was done using:

dfm <- structure(list(china = c(0L, 2L, 0L, 0L, 0L, 0L), kids = c(1L, 
0L, 2L, 0L, 0L, 6L), music = c(0L, 1L, 4L, 0L, 1L, 0L), yahoo = c(0L, 
3L, 0L, 0L, 0L, 0L), want = c(7L, 1L, 0L, 2L, 3L, 14L), wrong = c(0L, 
1L, 0L, 0L, 1L, 5L)), .Names = c("china", "kids", "music", "yahoo", 
"want", "wrong"), class = "data.frame", row.names = c("Gawker", 
"Read/WriteWeb", "WWdN: In Exile", "ProBlogger Blog Tips", "Seth's Blog", 
"The Huffington Post | Raw Feed"))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!