SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

问题

I have a DataFrame, I map it into an RDD of () to test an SVMModel.

I am using Zeppelin, and Spark 1.6.1

Here is my code:

val loadedSVMModel = SVMModel.load(sc, pathToSvmModel)

// Clear the default threshold.
loadedSVMModel.clearThreshold()

// Compute raw scores on the test set.
val scoreAndLabels = df.select($"features", $"label")
                       .map { case Row(features:Vector, label: Double) =>
                                val score = loadedSVMModel.predict(features)
                                (score,label)
                            }

// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

When executing the code I have a org.apache.spark.SparkException: Task not serializable; and I have a hard time understanding why this is happening and how can I fix it.

Is it caused by the fact that I am using Zeppelin?
Is it because of the original DataFrame?

I have executed the SVM example in the Spark Programming Guide, and it worked perfectly. So the reason should be related to one of the points above... I guess.

Here is the some relevant elements of the Exception stack:

Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.Column, value: (sum(CASE WHEN (domainIndex = 0) THEN sumOfScores ELSE 0),mode=Complete,isDistinct=false) AS 0#100278)
    - element of array (index: 0)
    - array (class [Lorg.apache.spark.sql.Column;, size 372)

I didn't post the full exception stack, because Zeppelin tend to show a very long not relevant text. please let me know if you want me to past the full exception.

Additional information

The feature vectors are generated using a VectorAssembler() as follow

// Prepare vector assemble
val vecAssembler =  new VectorAssembler()
                               .setInputCols(arrayOfIndices)
                               .setOutputCol("features")


// Aggregation expressions
val exprs = arrayOfIndices
                .map(c => sum(when($"domainIndex" === c, $"sumOfScores")
                .otherwise(lit(0))).alias(c))

val df = vecAssembler
           .transform(anotherDF.groupBy($"userID", $"val")
           .agg(exprs.head, exprs.tail: _*))
           .select($"userID", $"features", $"val")
           .withColumn("label", sqlCreateLabelValue($"val"))
           .drop($"val").drop($"userID")

回答1:

The source of the problem is actually not related to the DataFrame you use or even directly to Zeppelin. It is more a matter of code organization combined with existence of non-serializable object in the same scope.

Since you use interactive session all objects are defined in the same scope and become a part of the closure. It includes exprs which looks like a Seq[Column] where Column is not serializable.

It is not a problem when operate on SQL expressions because exprs are used only locally, but becomes problematic when you drop down to RDD operations. exprs is included as a part of a closure and leads to an expression. The simplest way you can reproduce this behavior (ColumnName is one the subclasses of Column) is something like this:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(1, 2, 3).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: int]

scala> val x = $"x"
x: org.apache.spark.sql.ColumnName = x

scala> def f(x: Any) = 0
f: (x: Any)Int

scala> df.select(x).rdd.map(f _)
org.apache.spark.SparkException: Task not serializable
...
Caused by: java.io.NotSerializableException: org.apache.spark.sql.ColumnName
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.ColumnName, value: x)
...

One way you can try to approach this problem is to mark exprs as transient:

@transient val exprs: Seq[Column] = ???

which works fine as well in our minimal example:

scala> @transient val x = $"x"
x: org.apache.spark.sql.ColumnName = x

scala> df.select(x).rdd.map(f _)
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at map at <console>:30

来源：https://stackoverflow.com/questions/37206108/spark-1-6-1-task-not-serializable-when-evaluating-a-classifier-on-a-dataframe

标签

scala

apache-spark

apache-zeppelin