Spark Word2Vec example using text8 file

后端 未结 2 1886
眼角桃花
眼角桃花 2020-12-18 10:32

I\'m trying to run this example from apache.spark.org (code is below & entire tutorial is here: https://spark.apache.org/docs/latest/mllib-feature-extraction.html) using

2条回答
  •  清酒与你
    2020-12-18 11:24

    sc.textFile splits on newlines only, and text8 contains no newlines.

    You are creating a 1-row RDD. .map(line => line.split(" ").toSeq) creates another 1-row RDD of type RDD[Seq[String]].

    Word2Vec works best with 1 sentence per row of RDD (and this should also avoid Java heap errors). Unfortunately text8 has had periods stripped out so you can't just split on them, but you can find the raw version here as well as the perl script used to process it, and it isn't hard to edit the script to not remove periods.

提交回复
热议问题