Why are Spark Parquet files for an aggregate larger than the original?

后端未结

关注

 1  1342

I am trying to create an aggregate file for end users to utilize to avoid having them process multiple sources with much larger files. To do that I: A) iterate through all s

相关标签:

1条回答

既然无缘

2020-11-30 11:25
In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage.

Aggregation, as the one you apply, has to shuffle the data. When you check the execution plan you'll see it is using hash partitioner. It means that after aggregation distribution can be less efficient than the one for the original data. At the same time sum can reduce number of rows but increase cardinality for rCount column.

You can try different tools to correct for that but not all are available in Spark 1.5.2:
- Sort complete dataset by columns having low cardinality (quite expensive due to full shuffle) or sortWithinPartitions.
- Use partitionBy method of DataFrameWriter to partition data using low cardinality columns.
- Use bucketBy and sortBy methods of DataFrameWriter (Spark 2.0.0+) to improve data distribution using bucketing and local sorting.
0 讨论(0)
发布评论:

提交评论
- 加载中...