BigQuery: Deleting Duplicates in Partitioned Table

前端 未结 2 582
难免孤独
难免孤独 2020-12-15 02:00

I have BQ table that is partitioned by insert time. I\'m trying to remove duplicates from the table. These are true duplicates: for 2 duplicate rows, all columns are equal -

相关标签:
2条回答
  • 2020-12-15 02:12

    Kind of a hack, but you can use the MERGE statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:

    -- Create a table with some duplicate rows
    CREATE TABLE dataset.PartitionedTable
    PARTITION BY date AS
    SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
    FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));
    

    Now for the MERGE part:

    -- Execute a MERGE statement where all original rows are deleted,
    -- then replaced with new, deduplicated rows:
    MERGE dataset.PartitionedTable AS t1
    USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
    ON FALSE
    WHEN NOT MATCHED BY TARGET THEN INSERT ROW
    WHEN NOT MATCHED BY SOURCE THEN DELETE
    
    0 讨论(0)
  • 2020-12-15 02:37

    You could do this in one single SQL MERGE statement without creating extra tables.

    -- WARNING: back up the table before this operation
    -- FOR large size timestamp partitioned table 
    -- -------------------------------------------
    -- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
    -- -------------------------------------------
    
    DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
    DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
    
    MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
    USING (
      SELECT k.*
      FROM (
        SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
        FROM `gcp_project`.`data_set`.`the_table` AS original_data
        WHERE stamp BETWEEN dt_start AND dt_end
        GROUP BY surrogate_key
      )
    
    ) AS INTERNAL_SOURCE
    ON FALSE
    
    WHEN NOT MATCHED BY SOURCE
      AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
        THEN DELETE
    
    WHEN NOT MATCHED THEN INSERT ROW
    

    credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

    0 讨论(0)
提交回复
热议问题