问题:

Error:

org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {   val numDistinctUsers = test_data.map(x => x.user).distinct().count()   val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {     (u._1, u._2.map(p => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet)   })   val hitsAndMiss: RDD[(Int, Double)] = userRecs.map(x => (x._1, x._2.intersect(x._3).size.toDouble))    val hits = hitsAndMiss.map(x => x._2).sum() / numDistinctUsers    return hits }

I am using the method in MatrixFactorizationModel.scala, I have to map over users and then call the method to get the results for each user. By doing that I introduce nested mapping which I believe cause the issue:

I know that issue actually take place at:

val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy(testUser => testUser.user).map(u => {   (u._1, u._2.map(p => p.product).toSet, model.recommendProducts(u._1, 20).map(prec => prec.product).toSet) })

Because while mapping over I am calling model.recommendProducts

回答1:

MatrixFactorizationModel is a distributed model so you cannot simply call it from an action or a transformation. The closest thing to what you do here is something like this:

import org.apache.spark.rdd.RDD import org.apache.spark.mllib.recommendation.{MatrixFactorizationModel, Rating}  def computeRatio(model: MatrixFactorizationModel, testUsers: RDD[Rating]) = {   val testData = testUsers.map(r => (r.user, r.product)).groupByKey   val n = testData.count    val recommendations = model      .recommendProductsForUsers(20)      .mapValues(_.map(r => r.product))    val hits = testData     .join(recommendations)     .values     .map{case (xs, ys) => xs.toSet.intersect(ys.toSet).size}     .sum    hits / n }

Notes:

distinct is an expensive operation and completely obsoletely here since you can obtain the same information from a grouped data
instead of groupBy followed by projection (map), project first and group later. There is no reason to transfer full ratings if you want only a product ids.

转载请标明出处:RDD transformations and actions can only be invoked by the driver

文章来源: RDD transformations and actions can only be invoked by the driver

标签

rdd