Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

谁说胖子不能爱 提交于 2019-12-03 17:29:14

After some investigations, it turns out that this issue was related to new HashingTF().transform(v) method. Although creating sparse vectors using hashing trick is really helpful (especially when the number of features is not known), vector must be kept sparse. Default size for HashingTF vectors is 2^20. Given a 64bits double precision, each vector would theoretically require 8MB when converted to Dense vector - regardless the dimension reduction we could apply.

Sadly, KMeans uses toDense method (at least for the cluster centers), therefore causing OutOfMemory error (imagine with k = 1000).

  private def initRandom(data: RDD[BreezeVectorWithNorm]) : Array[Array[BreezeVectorWithNorm]] = {
    val sample = data.takeSample(true, runs * k, new XORShiftRandom().nextInt()).toSeq
    Array.tabulate(runs)(r => sample.slice(r * k, (r + 1) * k).map { v =>
      new BreezeVectorWithNorm(v.vector.toDenseVector, v.norm)
    }.toArray)
  }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!