SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

最后都变了- 提交于 2019-12-06 09:13:45

The source of the problem is actually not related to the DataFrame you use or even directly to Zeppelin. It is more a matter of code organization combined with existence of non-serializable object in the same scope.

Since you use interactive session all objects are defined in the same scope and become a part of the closure. It includes exprs which looks like a Seq[Column] where Column is not serializable.

It is not a problem when operate on SQL expressions because exprs are used only locally, but becomes problematic when you drop down to RDD operations. exprs is included as a part of a closure and leads to an expression. The simplest way you can reproduce this behavior (ColumnName is one the subclasses of Column) is something like this:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(1, 2, 3).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: int]

scala> val x = $"x"
x: org.apache.spark.sql.ColumnName = x

scala> def f(x: Any) = 0
f: (x: Any)Int

scala> df.select(x).rdd.map(f _)
org.apache.spark.SparkException: Task not serializable
...
Caused by: java.io.NotSerializableException: org.apache.spark.sql.ColumnName
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.ColumnName, value: x)
...

One way you can try to approach this problem is to mark exprs as transient:

@transient val exprs: Seq[Column] = ???

which works fine as well in our minimal example:

scala> @transient val x = $"x"
x: org.apache.spark.sql.ColumnName = x

scala> df.select(x).rdd.map(f _)
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at map at <console>:30
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!