问题
I have a products table with the following schema:
id createdOn, updatedOn, stock, status
createdOn
& updatedOn
are TimeStamp
.
createdOn
is the paratition field.
Say this is the data I have now:
id createdOn, updatedOn, stock, status
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
2 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 5 12
3 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
I have a ETL
that append new rows to this table. when the ETL is finished I can have a situation where the same id
has more than 1 row.
For example:
id createdOn, updatedOn, stock, status
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
2 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 5 12
3 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
3 2018-09-14 14:14:24.305676 2018-09-15 10:00:00.000000 7 5
I want to have a query which will run over the table and make sure that each id has only 1 row - the row with the MAX(updatedOn)
should stay. There can be more than 1 row for the MAX(updatedOn)
per id - in that case it's guarantee that they are identical, because if they weren't than the updatedOn
field would have been modified.
After ruuning the query the table will look like:
id createdOn, updatedOn, stock, status
2 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 5 12
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
3 2018-09-14 14:14:24.305676 2018-09-15 10:00:00.000000 7 5
How can I write a query that efficiently perform this?
I know it should be something like:
DELETE FROM products
WHERE id NOT IN
(
SELECT MAX(id)
FROM products
GROUP BY id
)
However this won't work... I don't have auto-increment field to distinguish the rows.
How can I solve this?
回答1:
use row_number
function
DELETE FROM products
WHERE STRUCT(id,createdOn,stock) IN (
select id,createdOn,stock from
(SELECT id,createdOn,stock,
ROW_NUMBER()
OVER (PARTITION BY createdOn,stock,updatedOn order by id) as rn from products
) t where rn>1
)
Another way you can follow rather delete you can replace your table
CREATE OR REPLACE TABLE products AS
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY createdOn,stock,updatedOn order by id) rn
FROM products
)
WHERE rn> 1
回答2:
I strongly recommend that you just create a new table:
create table correct_table as
select distinct id, createdOn, updatedOn
from etl_table;
BigQuery's strength is processing the data. I try to find other solutions (such as copying tables) when updates or deletes seem to be needed.
You may want to re-think your processing. Just have the ETL load a table with the new rows. Then use BigQuery to insert the new rows that don't already exist. In other words, inserting the rows and then deleting them is not the way to go.
回答3:
I think Gordon Linoff is right, BigQuery use case is not to manipulate data and update already existing rows all the time. It's to fill it enormously and then analyze that data.
Anyway, this query would return just the rows you need:
SELECT DISTINCT id, createdOn, updatedOn, stock, status
FROM `project.dataset.maxtimestamp` AS t1
INNER JOIN (SELECT id AS i2, MAX(updatedOn) AS up
FROM `project.dataset.maxtimestamp`
GROUP BY id) AS t2
ON t1.id = t2.i2 AND t1.updatedOn = t2.up
As well as this one that you already found:
SELECT id, createdOn, updatedOn, stock, status
FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY createdOn, id ORDER BY updatedOn desc) AS rn
FROM `training-wave-12-vmarin.asdf.duplicated_timestamp` AS t2)
WHERE rn>1
Anyway, not sure about how optimized it is...
来源:https://stackoverflow.com/questions/52352520/how-to-remove-duplicated-row-by-timestamp-in-bigquery