Spark Word2Vec example using text8 file

后端未结

关注

 2  1886

眼角桃花 2020-12-18 10:32

I\'m trying to run this example from apache.spark.org (code is below & entire tutorial is here: https://spark.apache.org/docs/latest/mllib-feature-extraction.html) using

2条回答

清酒与你 (楼主)

2020-12-18 11:24

sc.textFile splits on newlines only, and text8 contains no newlines.

You are creating a 1-row RDD. .map(line => line.split(" ").toSeq) creates another 1-row RDD of type RDD[Seq[String]].

Word2Vec works best with 1 sentence per row of RDD (and this should also avoid Java heap errors). Unfortunately text8 has had periods stripped out so you can't just split on them, but you can find the raw version here as well as the perl script used to process it, and it isn't hard to edit the script to not remove periods.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...