Disclaimer: just starting to play with Spark.
I\'m having troubles understanding the famous \"Task not serializable\" exception but my question is a little different
When I look inside
DAGScheduler.submitMissingTasks
I see that it uses its closure serializer on my RDD, which is the Java serializer, not the Kryo serializer which I'd expect.
SparkEnv
supports two serializers, one named serializer
which is used for serialization of your data, checkpointing, messaging between workers, etc and is available under spark.serializer
configuration flag. The other is called closureSerializer
under spark.closure.serializer
which is used to check that your object is in fact serializable and is configurable for Spark <= 1.6.2 (but nothing other than JavaSerializer
actually works) and hardcoded from 2.0.0 and above to JavaSerializer
.
The Kryo closure serializer has a bug which make it unusable, you can see that bug under SPARK-7708 (this may be fixed with Kryo 3.0.0, but Spark is currently fixed with a specific version of Chill which is fixed on Kryo 2.2.1). Further, for Spark 2.0.x the JavaSerializer is now fixed instead of configurable (you can see it in this pull request). This mean that effectively we're stuck with the JavaSerializer
for closure serialization.
Is this weird that we're using one serializer to submit tasks and other to serialize data between workers and such? definitely, but this is what we have.
To sum up, if you're setting the spark.serializer
configuration, or using SparkContext.registerKryoClasses
you'll be utilizing Kryo for most of your serialization in Spark. Having said that, for checking if a given class is serializable and serialization of tasks to workers, Spark will use JavaSerializer
.