Spark dataframe write method writing many small files

前端未结

关注

 6  1695

轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答

误落风尘 (楼主)

2020-11-27 18:16
Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916

This is working for me very well:
```
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
```
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.

If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
```
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
  def copy(
          in : String,
          out : String,
          sparkSession: SparkSession
          ) = {
    FileUtil.copy(
      FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
      new Path(in),
      FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
      new Path(out),
      false,
      sparkSession.sparkContext.hadoopConfiguration
    )
  }
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...