Overwrite only some partitions in a partitioned spark Dataset

后端未结

关注

 2  1167

迷失自我 2020-12-02 14:05

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of d

2条回答

执念已碎 (楼主)

2020-12-02 14:06

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

from the source code:

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here

0 讨论(0)

查看其它2个回答