问题
There seems to be a lot of information about creating either hierarchical or k-means clusters. But I would like to know if there is an solution in R that would create K clusters of approximately equal sizes. There is some stuff out there about doing this in other languages, but I have not been able to find anything from searching on the internet that suggests how to achieve the result in R.
An example would be
set.seed(123)
df <- matrix(rnorm(100*5), nrow=100)
km <- kmeans(df, 10)
print(sapply(1:10, function(n) sum(km$cluster==n)))
which results in
[1] 14 12 4 13 16 6 8 7 13 7
I would ideally like to see
[1] 10 10 10 10 10 10 10 10 10 10
回答1:
I would argue that you shouldn't, in the first place. Why? When there are naturally well-formed clusters in your data, e.g.,
plot(matrix(c(sample(1:10,10),sample(30:40, 7), sample(80:90,9)), ncol=2, byrow = F))
then these will be clustered together anyway (assuming k equals the natural n of clusters; see this comprehensive answer on how to choose a good k). If they are uniform in size, then you will have clusters with ~equal size; if they are not, then forcing a uniform cluster size will surely deteriorate the fitness of the clustering solution. If you do not have naturally pretty clusters in your data, e.g,
plot(matrix(c(sample(1:100, 100), ncol=2)))
then forcing a cluster size will either be redundant (if the data is completely random, the cluster sizes will be ~equal - but then there is not much point in clustering anyhow), or, if there are some nice clusters in there, e.g.,
plot(matrix(c(sample(1:15,15),sample(20:100, 11)), ncol=2, byrow = T))
then the forced size will almost certainly break them.
The Ward's method mentioned in the comments by JasonAizkalns will, however, give you more "round" shaped clusters compared to single-link for example, so that might be a way to go (cf. help(hclust) for the difference between D and D2, it's not arbitrary).
回答2:
Its not totally clear what you're asking, but it very easy to generate random data in R. If your data set has two dimensions you could do something like this -
cluster1 = data.frame(x = rnorm(100, mean=5,sd=1), y = rnorm(100, mean=5,sd=1))
cluster2 = data.frame(x = rnorm(100, mean=15,sd=1), y = rnorm(100, mean=15,sd=1))
This generates normally distributed random data across x and y for 100 data points in each cluster.
Then view it -
plot(cluster1, xlim = c(0,25), ylim = c(0,25))
lines(cluster2, type = "p")!
来源:https://stackoverflow.com/questions/27804926/in-r-is-there-an-algorithm-to-create-approximately-equal-sized-clusters