Saving users and items features to HDFS in Spark Collaborative filtering RDD

问题

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)

// extract users latent factors
val users = model.userFeatures

// extract items latent factors
val items = model.productFeatures

// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected

However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:

[myname@host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D@3c3137b5)
(3,[D@505d9755)
(4,[D@241a409a)
(2,[D@c8c56dd)
.
.

It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:

for (user <- users) {
  val (userId, lf) = user
  val str = "user:" + userId + "\t" + lf.mkString(" ")
  println(str)
}

This does print what I want but I can't then write to HDFS (this prints on the console).

What should I do to get the complete array written to HDFS properly?

Spark version is 1.2.1.

回答1:

@JohnTitusJungao is right and also the following lines works as expected :

users.saveAsTextFile("myhdfs/outputdirectory/users") 
items.saveAsTextFile("myhdfs/outputdirectory/items")

And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D@3c3137b5 , D for double, followed by @ and hex code which is created using the Java toString method for this type of objects. More on that here.

val users: RDD[(Int, Array[Double])] = model.userFeatures

To solve that you'll need to make the array as a string :

val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))

The same goes for items.

来源：https://stackoverflow.com/questions/39008732/saving-users-and-items-features-to-hdfs-in-spark-collaborative-filtering-rdd

标签

arrays

apache-spark

HDFS

rdd