Why are Spark Parquet files for an aggregate larger than the original?

荒凉一梦 提交于 2019-11-27 09:08:43

In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage.

Aggregation, as the one you apply, has to shuffle the data. When you check the execution plan you'll see it is using hash partitioner. It means that after aggregation distribution can be less efficient than the one for the original data. At the same time sum can reduce number of rows but increase cardinality for rCount column.

You can try different tools to correct for that but not all are available in Spark 1.5.2:

  • Sort complete dataset by columns having low cardinality (quite expensive due to full shuffle) or sortWithinPartitions.
  • Use partitionBy method of DataFrameWriter to partition data using low cardinality columns.
  • Use bucketBy and sortBy methods of DataFrameWriter (Spark 2.0.0+) to improve data distribution using bucketing and local sorting.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!