问题
I have a table with >70M rows of data and 2M of duplicates. I want to clean duplicates by keeping the recent original row.
I found a few solutions from here - link
In which, solutions are only to clean the duplicates and not retain the recent data among the duplicates.
here is another common solution:
;WITH cte
AS (SELECT Row_number() OVER (partition BY id ORDER BY
updatedAt
DESC,
status DESC) RN
FROM MainTable)
DELETE FROM cte
WHERE RN > 1
But it is not supported in BigQuery.
回答1:
Here is the workaround, which replaces the existing table with unique rows and recent original rows.
CREATE OR REPLACE TABLE
`MainTable` AS
SELECT
id,
acctId,
appId,
createdAt,
startTime,
subAcctId,
type,
updatedAt,
userId
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM
`MainTable`)
WHERE
RN = 1
Since we don't have the option to remove a particular column(rn), have to select the required columns while replacing the existing table.
Hope this helps someone. Please share if you have any better solutions.
回答2:
Below is for BigQuery Standard SQL
CREATE OR REPLACE TABLE
`MainTable` AS
SELECT * EXCEPT(RN)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM
`MainTable`)
WHERE
RN = 1
来源:https://stackoverflow.com/questions/55831450/delete-oldest-duplicate-rows-from-a-bigquery-table