In R, is there an algorithm to create approximately equal sized clusters

心不动则不痛 提交于 2019-12-05 11:01:38
user3554004

I would argue that you shouldn't, in the first place. Why? When there are naturally well-formed clusters in your data, e.g.,

plot(matrix(c(sample(1:10,10),sample(30:40, 7), sample(80:90,9)), ncol=2, byrow = F))

then these will be clustered together anyway (assuming k equals the natural n of clusters; see this comprehensive answer on how to choose a good k). If they are uniform in size, then you will have clusters with ~equal size; if they are not, then forcing a uniform cluster size will surely deteriorate the fitness of the clustering solution. If you do not have naturally pretty clusters in your data, e.g,

plot(matrix(c(sample(1:100, 100), ncol=2)))

then forcing a cluster size will either be redundant (if the data is completely random, the cluster sizes will be ~equal - but then there is not much point in clustering anyhow), or, if there are some nice clusters in there, e.g.,

plot(matrix(c(sample(1:15,15),sample(20:100, 11)), ncol=2, byrow = T))

then the forced size will almost certainly break them.

The Ward's method mentioned in the comments by JasonAizkalns will, however, give you more "round" shaped clusters compared to single-link for example, so that might be a way to go (cf. help(hclust) for the difference between D and D2, it's not arbitrary).

Its not totally clear what you're asking, but it very easy to generate random data in R. If your data set has two dimensions you could do something like this -

cluster1 = data.frame(x = rnorm(100, mean=5,sd=1), y  = rnorm(100, mean=5,sd=1))
cluster2 = data.frame(x = rnorm(100, mean=15,sd=1), y  = rnorm(100, mean=15,sd=1))

This generates normally distributed random data across x and y for 100 data points in each cluster.

Then view it -

plot(cluster1, xlim = c(0,25), ylim = c(0,25))
lines(cluster2, type = "p")!
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!