Unbalanced factor of KMeans?

时光总嘲笑我的痴心妄想 提交于 2019-11-26 22:07:40

问题


Edit: The answer of this questions is heavily discussed in: Sum in Spark gone bad


In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor?

If there is no such functionality provide by Spark, is there any easy way to implement this?


I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments):

// @hist: the number of points assigned to a cluster
// @n:    the number of clusters
double ivec_unbalanced_factor(const int *hist, long n) {
  int vw;
  double tot = 0, uf = 0;

  for (vw = 0 ; vw < n ; vw++) {
    tot += hist[vw];
    uf += hist[vw] * (double) hist[vw];
  }

  uf = uf * n / (tot * tot);

  return uf;

}

which I found here.

So the idea is that tot (for total) will be equal to the number of points assigned to clusters (i.e. equal to the size of our dataset), while uf (for unbalanced factor) holds the square of the number of points assigned to a cluster.

Finally he uses uf = uf * n / (tot * tot); to compute it.


回答1:


In python it could be something like:

# I suppose you are passing an RDD of tuples, where the key is the cluster and the value is a vector with the features.
def unbalancedFactor(rdd):
  pdd = rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b) # you can obtain the number of points per cluster
  n = pdd.count()
  total = pdd.map(lambda x: x[1]).sum() 
  uf = pdd.map(lambda x: x[1] * float(x[1])).sum()

  return uf * n / (total * total)


来源:https://stackoverflow.com/questions/39235576/unbalanced-factor-of-kmeans

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!