How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of d
Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append
from the source code:
def insertInto(self, tableName, overwrite=False):
self._jwrite.mode(
"overwrite" if overwrite else "append"
).insertInto(tableName)
this how to use it:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)
or in the SQL version works fine.
INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement
for doc look at here