Spark dataframe write method writing many small files

前端 未结 6 1691
轮回少年
轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答
  •  被撕碎了的回忆
    2020-11-27 18:24

    I came across the same issue and I could using coalesce solved my problem.

    df
      .coalesce(3) // number of parts/files 
      .write.mode(SaveMode.Append)
      .parquet(s"$path")
    

    For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition

提交回复
热议问题