Spark generated multiple small parquet Files. How can one handle efficiently small number of parquet files both on producer and consumer Spark jobs.
The most straightforward approach IMHO is to use repartition/coalesce (prefer coalesce unless data is skewed and you want to create same-sized outputs) before writing parquet files so that you will not create small files to begin with.
df
.map()
.filter()
///...
.coalesce()
.write
.parquet()
Number of partitions could be calculated on count of total rows in dataframe divided by some factor that through trial and error will give you the proper size.
It is best practice in most of the Big data frameworks to preffer few larger files to many small files (file size I normally use is 100-500MB)
If you already have data in small files and you want to merge it as far as I'm aware you will have to read it with Spark repartition to fewer partitions and write it again.