How to store custom objects in Dataset?

前端 未结 9 1179
别那么骄傲
别那么骄傲 2020-11-22 01:53

According to Introducing Spark Datasets:

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom

9条回答
  •  孤城傲影
    2020-11-22 02:02

    In addition to the suggestions already given, another option I recently discovered is that you can declare your custom class including the trait org.apache.spark.sql.catalyst.DefinedByConstructorParams.

    This works if the class has a constructor that uses types the ExpressionEncoder can understand, i.e. primitive values and standard collections. It can come in handy when you're not able to declare the class as a case class, but don't want to use Kryo to encode it every time it's included in a Dataset.

    For example, I wanted to declare a case class that included a Breeze vector. The only encoder that would be able to handle that would normally be Kryo. But if I declared a subclass that extended the Breeze DenseVector and DefinedByConstructorParams, the ExpressionEncoder understood that it could be serialized as an array of Doubles.

    Here's how I declared it:

    class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
    implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
    

    Now I can use SerializableDenseVector in a Dataset (directly, or as part of a Product) using a simple ExpressionEncoder and no Kryo. It works just like a Breeze DenseVector but serializes as an Array[Double].

提交回复
热议问题