Spark dataframe write method writing many small files

前端 未结 6 1674
轮回少年
轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答
  •  攒了一身酷
    2020-11-27 18:30

    The simplest solution would be to replace your actual partitioning by :

    df
     .repartition(to_date($"date"))
     .write.mode(SaveMode.Append)
     .partitionBy("date")
     .parquet(s"$path")
    

    You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer. That actually depends on the amount of data.

    You can reduce entropy by partitioning DataFrame and the write with partition by clause.

提交回复
热议问题