Overwrite only some partitions in a partitioned spark Dataset

三世轮回 提交于 2019-11-27 01:10:04

问题


How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.


回答1:


Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example in scala:

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.




回答2:


Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

from the source code:

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here



来源:https://stackoverflow.com/questions/50006526/overwrite-only-some-partitions-in-a-partitioned-spark-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!