Spark::KMeans calls takeSample() twice?

百般思念 提交于 2019-12-01 22:48:35

As suggested by Shivaram Venkataraman in Spark's mailing list:

I think takeSample itself runs multiple jobs if the amount of samples collected in the first pass is not enough. The comment and code path at GitHub should explain when this happens. Also you can confirm this by checking if the logWarning shows up in your logs.

// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
  logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
  samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
  numIters += 1
}

However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.

It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!