Spark dataset encoders: kryo() vs bean()

半世苍凉 提交于 2019-12-12 16:34:25

问题


While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>) or Encoders.kryo(Class<T>).

How are these different and what are the performance implications of using one vs another?


回答1:


It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below.

  • Kryo Serialization is faster than Java Serialization.
  • Kryo Serialization uses less memory footprint especially, in the cases when you may need to Cache() and Persist(). This is very helpful during the phases like Shuffling.
  • Though Kryo is supported for caching and shuffling it is not supported during persistence to the disk.
  • saveAsObjectFile on RDD and objectFile method on SparkContext supports only java serialization.
  • The more Custom Data Types you are handling in your datasets the more complexity it is to handle them. Therefore, It is usually the best practice to use a uniform serialization like Kryo.
  • Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.
  • Java Serialization needs to store the fully qualified class names while serializing objects.But, Kryo lets you avoid this by saving/registering the classes sparkConf.registerKryoClasses(Array( classOf[A], classOf[B], ...)) or sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator"). Which saves a lot of space and avoids unnecessary metadata.

Difference between the bean() and javaSerialization() is javaSerialization serializes objects of type T using generic java serialization. This encoder maps T into a single byte array (binary) field. Where as bean creates an encoder for Java Bean of type T. Both of them uses Java Serialization the only difference is how they represent the objects into bytes.

Quoting from the documentation

JavaSerialization is extremely inefficient and should only be used as the last resort.



来源:https://stackoverflow.com/questions/50356088/spark-dataset-encoders-kryo-vs-bean

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!