Overwrite only some partitions in a partitioned spark Dataset

后端 未结 2 1167
迷失自我
迷失自我 2020-12-02 14:05

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of d

2条回答
  •  执念已碎
    2020-12-02 14:06

    Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

    from the source code:

    def insertInto(self, tableName, overwrite=False):
        self._jwrite.mode(
            "overwrite" if overwrite else "append"
        ).insertInto(tableName)
    

    this how to use it:

    spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
    data.write.insertInto("partitioned_table", overwrite=True)
    

    or in the SQL version works fine.

    INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement
    

    for doc look at here

提交回复
热议问题