Multiple spark jobs appending parquet data to same base path with partitioning

后端 未结 4 877
粉色の甜心
粉色の甜心 2020-12-08 01:07

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning.

e.g.

dataFrame.write().
             


        
4条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-08 01:49

    I suspect this is because of the changes to partition discovery that were introduced in Spark 1.6. The changes means that Spark will only treat paths like .../xxx=yyy/ as partitions if you have specified a "basepath"-option (see Spark release notes here).

    So I think your problem will be solved if you add the basepath-option, like this:

    dataFrame
      .write()
      .partitionBy("eventDate", "category")
      .option("basepath", "s3://bucket/save/path")
      .mode(Append)
      .parquet("s3://bucket/save/path");
    

    (I haven't had the chance to verify it, but hopefully it will do the trick :))

提交回复
热议问题