Saving users and items features to HDFS in Spark Collaborative filtering RDD

心不动则不痛 提交于 2019-12-11 04:16:11

问题


I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)

// extract users latent factors
val users = model.userFeatures

// extract items latent factors
val items = model.productFeatures

// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected

However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:

[myname@host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D@3c3137b5)
(3,[D@505d9755)
(4,[D@241a409a)
(2,[D@c8c56dd)
.
.

It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:

for (user <- users) {
  val (userId, lf) = user
  val str = "user:" + userId + "\t" + lf.mkString(" ")
  println(str)
}

This does print what I want but I can't then write to HDFS (this prints on the console).

What should I do to get the complete array written to HDFS properly?

Spark version is 1.2.1.


回答1:


@JohnTitusJungao is right and also the following lines works as expected :

users.saveAsTextFile("myhdfs/outputdirectory/users") 
items.saveAsTextFile("myhdfs/outputdirectory/items")

And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D@3c3137b5 , D for double, followed by @ and hex code which is created using the Java toString method for this type of objects. More on that here.

val users: RDD[(Int, Array[Double])] = model.userFeatures

To solve that you'll need to make the array as a string :

val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))

The same goes for items.



来源:https://stackoverflow.com/questions/39008732/saving-users-and-items-features-to-hdfs-in-spark-collaborative-filtering-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!