DataFrame partitionBy to a single Parquet file (per partition)

前端 未结 2 575
谎友^
谎友^ 2020-12-07 13:56

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could

2条回答
  •  南方客
    南方客 (楼主)
    2020-12-07 14:35

    I had the exact same problem and I found a way to do this using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn't help either -- if you do coalesce(10) you get more parallelism, but end up with 10 files per partition.

    To get one file per partition without using coalesce(), use repartition() with the same columns you want the output to be partitioned by. So in your case, do this:

    import spark.implicits._
    df.repartition($"entity", $"year", $"month", $"day", $"status").write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location")
    

    Once I do that I get one parquet file per output partition, instead of multiple files.

    I tested this in Python, but I assume in Scala it should be the same.

提交回复
热议问题