overwrite hive partitions using spark

后端 未结 4 1281
北荒
北荒 2021-02-05 21:44

I am working with AWS and I have workflows that use Spark and Hive. My data is partitioned by the date, so everyday I have a new partition in my S3 storage. My problem is when

4条回答
  •  半阙折子戏
    2021-02-05 22:14

    Adding to what wandermonk@ mentioned,


    Dynamic Partition Inserts is only supported in SQL mode (for INSERT OVERWRITE TABLE SQL statements). Dynamic Partition Inserts is not supported for non-file-based data sources, i.e. InsertableRelations.

    With Dynamic Partition Inserts, the behaviour of OVERWRITE keyword is controlled by spark.sql.sources.partitionOverwriteMode configuration property (default: static). The property controls whether Spark should delete all the partitions that match the partition specification regardless of whether there is data to be written to or not (static) or delete only those partitions that will have data written into (dynamic).

    When the dynamic overwrite mode is enabled Spark will only delete the partitions for which it has data to be written to. All the other partitions remain intact.

    From

    From the Writing Into Dynamic Partitions Using Spark (https://medium.com/a-muggles-pensieve/writing-into-dynamic-partitions-using-spark-2e2b818a007a)


    Spark now writes data partitioned just as Hive would — which means only the partitions that are touched by the INSERT query get overwritten and the others are not touched.

提交回复
热议问题