I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12
The simplest solution would be to replace your actual partitioning by :
df
.repartition(to_date($"date"))
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
You can also use more precise partitioning for your DataFrame
i.e the day and maybe the hour of an hour range. and then you can be less precise for writer.
That actually depends on the amount of data.
You can reduce entropy by partitioning DataFrame
and the write with partition by clause.