How to remove duplicated row by timestamp in BigQuery?

问题

I have a products table with the following schema:

id  createdOn, updatedOn, stock, status

createdOn & updatedOn are TimeStamp.

createdOn is the paratition field.

Say this is the data I have now:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5

I have a ETL that append new rows to this table. when the ETL is finished I can have a situation where the same id has more than 1 row.

For example:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

I want to have a query which will run over the table and make sure that each id has only 1 row - the row with the MAX(updatedOn) should stay. There can be more than 1 row for the MAX(updatedOn) per id - in that case it's guarantee that they are identical, because if they weren't than the updatedOn field would have been modified.

After ruuning the query the table will look like:

id  createdOn,                    updatedOn,                stock, status
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

How can I write a query that efficiently perform this?

I know it should be something like:

DELETE FROM products
WHERE id NOT IN
(
    SELECT MAX(id)
    FROM products
    GROUP BY id
)

However this won't work... I don't have auto-increment field to distinguish the rows.

How can I solve this?

回答1:

use row_number function

DELETE FROM products
WHERE STRUCT(id,createdOn,stock) IN (
        select id,createdOn,stock from
        (SELECT id,createdOn,stock,
          ROW_NUMBER()
                  OVER (PARTITION BY createdOn,stock,updatedOn order by id) as rn from products
                  ) t where rn>1
         )

Another way you can follow rather delete you can replace your table

CREATE OR REPLACE TABLE products AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY createdOn,stock,updatedOn order by id) rn
  FROM products
) 
WHERE rn> 1

回答2:

I strongly recommend that you just create a new table:

create table correct_table as
    select distinct id, createdOn, updatedOn
    from etl_table;

BigQuery's strength is processing the data. I try to find other solutions (such as copying tables) when updates or deletes seem to be needed.

You may want to re-think your processing. Just have the ETL load a table with the new rows. Then use BigQuery to insert the new rows that don't already exist. In other words, inserting the rows and then deleting them is not the way to go.

回答3:

I think Gordon Linoff is right, BigQuery use case is not to manipulate data and update already existing rows all the time. It's to fill it enormously and then analyze that data.

Anyway, this query would return just the rows you need:

SELECT DISTINCT id, createdOn,  updatedOn,  stock,  status
FROM `project.dataset.maxtimestamp` AS t1
INNER JOIN (SELECT id AS i2, MAX(updatedOn) AS up
FROM `project.dataset.maxtimestamp`
GROUP BY id) AS t2
ON t1.id = t2.i2 AND t1.updatedOn = t2.up

As well as this one that you already found:

SELECT id,  createdOn,  updatedOn,  stock,  status
FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY createdOn, id ORDER BY updatedOn desc) AS rn
       FROM `training-wave-12-vmarin.asdf.duplicated_timestamp` AS t2)
WHERE rn>1

Anyway, not sure about how optimized it is...

来源：https://stackoverflow.com/questions/52352520/how-to-remove-duplicated-row-by-timestamp-in-bigquery

标签

sql

google-bigquery