Append new data to partitioned parquet files

前端 未结 2 2051
暗喜
暗喜 2021-02-01 07:01

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them a

2条回答
  •  天命终不由人
    2021-02-01 07:47

    If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.

    Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:

    data
     .filter(validPartnerIds($"partnerID"))
     .repartition([optional integer,] "partnerID","year","month","day")
     .write
     .partitionBy("partnerID","year","month","day")
     .parquet(saveDestination)
    

    See: DataFrame.repartition

提交回复
热议问题