Understanding Spark MLlib LDA input format

会有一股神秘感。 提交于 2019-12-06 13:21:32

问题


I am trying to implement LDA using Spark MLlib.

But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown :

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

I followed http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

I understand the output format of this as explained here.

My use case is very simple, I have one data file with some sentences. I want to convert this file into corpus so that to pass it to org.apache.spark.mllib.clustering.LDA.run().

My doubt is about what those numbers in input represent which is then zipWithIndex and passed to LDA? Is it like number 1 appearing everywhere represent same word or it is some kind of count?


回答1:


First you need to convert your sentences into vectors.

val documents: RDD[Seq[String]] = sc.textFile("yourfile").map(_.split("      ").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val corpus = tfidf.zipWithIndex.map(_.swap).cache()

 // Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

Read more about TF_IDF vectorization here



来源:https://stackoverflow.com/questions/37869165/understanding-spark-mllib-lda-input-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!