I am trying to create an aggregate file for end users to utilize to avoid having them process multiple sources with much larger files. To do that I: A) iterate through all s
In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage.
Aggregation, as the one you apply, has to shuffle the data. When you check the execution plan you'll see it is using hash partitioner. It means that after aggregation distribution can be less efficient than the one for the original data. At the same time sum
can reduce number of rows but increase cardinality for rCount
column.
You can try different tools to correct for that but not all are available in Spark 1.5.2:
sortWithinPartitions
.partitionBy
method of DataFrameWriter
to partition data using low cardinality columns.bucketBy
and sortBy
methods of DataFrameWriter
(Spark 2.0.0+) to improve data distribution using bucketing and local sorting.