How to delete a particular month from a parquet file partitioned by month

前端 未结 2 1306
不思量自难忘°
不思量自难忘° 2020-12-18 14:50

I am having monthly Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet formats in append mode, but

2条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-18 15:29

    Spark supports deleting partition, both data and metadata.
    Quoting the scala code comment

    /**
     * Drop Partition in ALTER TABLE: to drop a particular partition for a table.
     *
     * This removes the data and metadata for this partition.
     * The data is actually moved to the .Trash/Current directory if Trash is configured,
     * unless 'purge' is true, but the metadata is completely lost.
     * An error message will be issued if the partition does not exist, unless 'ifExists' is true.
     * Note: purge is always false when the target is a view.
     *
     * The syntax of this command is:
     * {{{
     *   ALTER TABLE table DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] [PURGE];
     * }}}
     */
    

    In your case, there is no backing table. We could register the dataframe as a temp table and use the above syntax(temp table documentation)

    From pyspark, we could run the SQL using the syntax in this link Sample:

    df = spark.read.format('parquet').load('Revenue.parquet'). registerTempTable("tmp")
    spark.sql("ALTER TABLE tmp DROP IF EXISTS PARTITION (month='2015-02-01') PURGE")
    

提交回复
热议问题