BigQuery - removing duplicate records sometimes taking long

后端未结

关注

 1  846

离开以前 2021-01-16 18:48

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cl

1条回答

天命终不由人 (楼主)

2021-01-16 19:41

It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:

#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;

With that said, it may be faster to remove duplicates with this query instead:

#standardSQL
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

Here is an example:

#standardSQL
WITH T AS (
  SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
  SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
  SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

0 讨论(0)