发表新帖

发表新帖

How to store custom objects in Dataset?

前端未结

关注

 9  1179

别那么骄傲 2020-11-22 01:53

According to Introducing Spark Datasets:

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom

9条回答

孤城傲影 (楼主)

2020-11-22 02:02
In addition to the suggestions already given, another option I recently discovered is that you can declare your custom class including the trait org.apache.spark.sql.catalyst.DefinedByConstructorParams.

This works if the class has a constructor that uses types the ExpressionEncoder can understand, i.e. primitive values and standard collections. It can come in handy when you're not able to declare the class as a case class, but don't want to use Kryo to encode it every time it's included in a Dataset.

For example, I wanted to declare a case class that included a Breeze vector. The only encoder that would be able to handle that would normally be Kryo. But if I declared a subclass that extended the Breeze DenseVector and DefinedByConstructorParams, the ExpressionEncoder understood that it could be serialized as an array of Doubles.

Here's how I declared it:
```
class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
```
Now I can use SerializableDenseVector in a Dataset (directly, or as part of a Product) using a simple ExpressionEncoder and no Kryo. It works just like a Breeze DenseVector but serializes as an Array[Double].
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...

热议问题