Spark mllib : how to convert string categorical features into int for Rating to accept

被刻印的时光 ゝ 提交于 2019-12-04 05:33:39

问题


I want to build a recommendation application using spark mllib and the ALS algorithm in collaborative filtering technique. My data set has the user and product features in string form like :

[{"user":"StringName1", "product":"StringProduct1", "rating":1},
 {"user":"StringName2", "product":"StringProduct2", "rating":2},
 {"user":"StringName1", "product":"StringProduct2", "rating":3},..]

But the Rating method seems to accept only int values for both user and product features. Does that mean I will have to build a separate dictionary to map each string to an int? My dataset will have duplicate entries for both user and product.Is there a built-in solution for this in the mllib library itself?

Thanks and any help appreciated!

Edit: No, this is not a duplicate as the answer in that question doesn't seem to fit my scenario. spark.ml.recommendation.ALS.Rating library doesn't seem to support String values for user or item. I need this support.


回答1:


Let me try. Assuming that data: RDD[(String, String, Float)]

import org.apache.spark.mllib.recommendation.Rating

val data = sc.parallelize(Array(("StringName1", "StringProduct1", 1.0), ("StringName2", "StringProduct2", 2.0), ("StringName3", "StringProduct3", 3.0)))

//get distinct names and products and create maps from them
val names = data.map(_._1).distinct.sortBy(x => x).zipWithIndex.collectAsMap
val products = data.map(_._2).distinct.sortBy(x => x).zipWithIndex.collectAsMap

//convert to Rating format
val data_rating = data.map(r => Rating(names(r._1).toInt, products(r._2).toInt, r._3))

That should do it. Basically, you just create a mapping from string to long and then convert long to int.



来源:https://stackoverflow.com/questions/38654427/spark-mllib-how-to-convert-string-categorical-features-into-int-for-rating-to

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!