Spark dataframe write method writing many small files

前端未结

关注

 6  1674

轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答

攒了一身酷 (楼主)

2020-11-27 18:30
The simplest solution would be to replace your actual partitioning by :
```
df
 .repartition(to_date($"date"))
 .write.mode(SaveMode.Append)
 .partitionBy("date")
 .parquet(s"$path")
```
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer. That actually depends on the amount of data.

You can reduce entropy by partitioning DataFrame and the write with partition by clause.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...