Spark::KMeans calls takeSample() twice?

Deadly 提交于 2019-12-20 03:45:09

问题


I have many data and I have experimented with partitions of cardinality [20k, 200k+].

I call it like that:

from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)

and I see that initRandom() calls takeSample() once.

Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?

Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.

Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.

I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!


I brought this to the mailing list of the Spark devs, so I am updating:

Details of 1st takeSample():

Details of 2nd takeSample():

where one can see that the same code is executed.


回答1:


As suggested by Shivaram Venkataraman in Spark's mailing list:

I think takeSample itself runs multiple jobs if the amount of samples collected in the first pass is not enough. The comment and code path at GitHub should explain when this happens. Also you can confirm this by checking if the logWarning shows up in your logs.

// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
  logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
  samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
  numIters += 1
}

However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.

It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.



来源:https://stackoverflow.com/questions/38986395/sparkkmeans-calls-takesample-twice

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!