According to Introducing Spark Datasets:
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom
In addition to the suggestions already given, another option I recently discovered is that you can declare your custom class including the trait org.apache.spark.sql.catalyst.DefinedByConstructorParams
.
This works if the class has a constructor that uses types the ExpressionEncoder can understand, i.e. primitive values and standard collections. It can come in handy when you're not able to declare the class as a case class, but don't want to use Kryo to encode it every time it's included in a Dataset.
For example, I wanted to declare a case class that included a Breeze vector. The only encoder that would be able to handle that would normally be Kryo. But if I declared a subclass that extended the Breeze DenseVector and DefinedByConstructorParams, the ExpressionEncoder understood that it could be serialized as an array of Doubles.
Here's how I declared it:
class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
Now I can use SerializableDenseVector
in a Dataset (directly, or as part of a Product) using a simple ExpressionEncoder and no Kryo. It works just like a Breeze DenseVector but serializes as an Array[Double].