Unbalanced factor of KMeans?

问题

Edit: The answer of this questions is heavily discussed in: Sum in Spark gone bad

In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor?

If there is no such functionality provide by Spark, is there any easy way to implement this?

I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments):

// @hist: the number of points assigned to a cluster
// @n:    the number of clusters
double ivec_unbalanced_factor(const int *hist, long n) {
  int vw;
  double tot = 0, uf = 0;

  for (vw = 0 ; vw < n ; vw++) {
    tot += hist[vw];
    uf += hist[vw] * (double) hist[vw];
  }

  uf = uf * n / (tot * tot);

  return uf;

}

which I found here.

So the idea is that tot (for total) will be equal to the number of points assigned to clusters (i.e. equal to the size of our dataset), while uf (for unbalanced factor) holds the square of the number of points assigned to a cluster.

Finally he uses uf = uf * n / (tot * tot); to compute it.

回答1:

In python it could be something like:

# I suppose you are passing an RDD of tuples, where the key is the cluster and the value is a vector with the features.
def unbalancedFactor(rdd):
  pdd = rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b) # you can obtain the number of points per cluster
  n = pdd.count()
  total = pdd.map(lambda x: x[1]).sum() 
  uf = pdd.map(lambda x: x[1] * float(x[1])).sum()

  return uf * n / (total * total)

来源：https://stackoverflow.com/questions/39235576/unbalanced-factor-of-kmeans

标签

apache-spark

machine-learning

pyspark

k-means

bigdata