how to keep records information when working in Mllib

爷,独闯天下 提交于 2019-12-06 13:29:30

问题


I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features>. When I create an RDD of LabeledPoint all the other fields (id,field1 and field2) are gone and I can't make the relation between my scored instance and the original one. How can I solved this issue. After the scoring, I need to know the ids' and the score/predicted_label.

This problem doesn't exist in ML as it uses DataFrame and I can simply add another column with the score to my original dataframe.


回答1:


A solution to your problem is that the map method of RDD retains order; therefore, you can use the RDD.zip method with the id's.

Here is an answer that shows the procedure

Spark MLLib Kmeans from dataframe, and back again

It's very easy to obtain pairs of ids and clusters in form of RDD:

val idPointRDD = data.rdd.map(s => (s.getInt(0),
     Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)

Then you create DataFrame from that

val idCluster = idClusterRDD.toDF("id", "cluster")

It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.



来源:https://stackoverflow.com/questions/38001318/how-to-keep-records-information-when-working-in-mllib

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!