Spark: Dataset Serialization

早过忘川 提交于 2020-12-28 23:44:10

问题


If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:

myDS.persist(StorageLevel.MERORY_ONLY_SER)

Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset?


回答1:


Spark Dataset does not use standard serializers. Instead it uses Encoders, which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder, including Row) into internal binary storage.

The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_] or Encoders.java[_]. In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product encoder, etc.). The only difference compared to Row is its Encoder - RowEncoder (in a sense Encoders are similar to lenses).

Databricks explicitly puts Encoder / Dataset serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section)

Source of the images

  • Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Apache Spark Datasets, https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html



回答2:


Dataset[SomeCaseClass] is not different from Dataset[Row] or any other Dataset. It uses the same internal representation (mapped to instances of external class when needed) and the same serialization method.

Therefore, the is no need for direct object serialization (Java, Kryo).




回答3:


Under the hood, a dataset is an RDD. From the documentation for RDD persistence:

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

By default, Java serialization is used source:

By default, Spark serializes objects using Java’s ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.

To enable Kryo, initialize the job with a SparkConf and set spark.serializer to org.apache.spark.serializer.KryoSerializer:

val conf = new SparkConf()
             .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

You may need to register classes with Kryo before creating the SparkContext:

conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))


来源:https://stackoverflow.com/questions/47983465/spark-dataset-serialization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!