Spark dataframe write method writing many small files

前端 未结 6 1695
轮回少年
轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答
  •  误落风尘
    2020-11-27 18:16

    Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916

    This is working for me very well:

    data.repartition(n, "key").write.partitionBy("key").parquet("/location")
    

    It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.

    If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:

    import java.net.URI
    import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
    // ...
      def copy(
              in : String,
              out : String,
              sparkSession: SparkSession
              ) = {
        FileUtil.copy(
          FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
          new Path(in),
          FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
          new Path(out),
          false,
          sparkSession.sparkContext.hadoopConfiguration
        )
      }
    

提交回复
热议问题