Spark dataframe write method writing many small files

前端 未结 6 1690
轮回少年
轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-11-27 18:22

    you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

    Try this:

    df
    .repartition($"date")
    .write.mode(SaveMode.Append)
    .partitionBy("date")
    .parquet(s"$path")
    

提交回复
热议问题