how to keep records information when working in Mllib

问题

I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features>. When I create an RDD of LabeledPoint all the other fields (id,field1 and field2) are gone and I can't make the relation between my scored instance and the original one. How can I solved this issue. After the scoring, I need to know the ids' and the score/predicted_label.

This problem doesn't exist in ML as it uses DataFrame and I can simply add another column with the score to my original dataframe.

回答1:

A solution to your problem is that the map method of RDD retains order; therefore, you can use the RDD.zip method with the id's.

Here is an answer that shows the procedure

Spark MLLib Kmeans from dataframe, and back again

It's very easy to obtain pairs of ids and clusters in form of RDD:

val idPointRDD = data.rdd.map(s => (s.getInt(0),
     Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)

Then you create DataFrame from that

val idCluster = idClusterRDD.toDF("id", "cluster")

It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.

来源：https://stackoverflow.com/questions/38001318/how-to-keep-records-information-when-working-in-mllib

标签

apache-spark

apache-spark-mllib