Is gzipped Parquet file splittable in HDFS for Spark?

假装没事ソ 提交于 2019-12-04 11:32:57

问题


I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?


回答1:


Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

This fact is mainly due to the design of Parquet files that divided in the following parts:

  1. Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
  2. Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
  3. ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.

You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format



来源:https://stackoverflow.com/questions/43323882/is-gzipped-parquet-file-splittable-in-hdfs-for-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!