Delete Oldest Duplicate Rows from a BigQuery Table

问题

I have a table with >70M rows of data and 2M of duplicates. I want to clean duplicates by keeping the recent original row.

I found a few solutions from here - link

In which, solutions are only to clean the duplicates and not retain the recent data among the duplicates.

here is another common solution:

;WITH cte 
     AS (SELECT Row_number() OVER (partition BY id ORDER BY 
                updatedAt 
                DESC, 
                status DESC) RN 
         FROM   MainTable) 
DELETE FROM cte 
WHERE  RN > 1

But it is not supported in BigQuery.

回答1:

Here is the workaround, which replaces the existing table with unique rows and recent original rows.

CREATE OR REPLACE TABLE
  `MainTable` AS
SELECT
  id,
  acctId,
  appId,
  createdAt,
  startTime,
  subAcctId,
  type,
  updatedAt,
  userId
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
      ) RN
  FROM
    `MainTable`)
WHERE
  RN = 1

Since we don't have the option to remove a particular column(rn), have to select the required columns while replacing the existing table.

Hope this helps someone. Please share if you have any better solutions.

回答2:

Below is for BigQuery Standard SQL

CREATE OR REPLACE TABLE
  `MainTable` AS
SELECT * EXCEPT(RN)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
      ) RN
  FROM
    `MainTable`)
WHERE
  RN = 1

来源：https://stackoverflow.com/questions/55831450/delete-oldest-duplicate-rows-from-a-bigquery-table

标签

google-bigquery

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!