BigQuery - removing duplicate records sometimes taking long

后端 未结 1 846
离开以前
离开以前 2021-01-16 18:48

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cl

1条回答
  •  天命终不由人
    2021-01-16 19:41

    It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:

    #standardSQL
    SELECT id, COUNT(*) AS id_count
    FROM rawData.stock_movement
    GROUP BY id
    ORDER BY id_count DESC LIMIT 5;
    

    With that said, it may be faster to remove duplicates with this query instead:

    #standardSQL
    SELECT latest_row.*
    FROM (
      SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
      FROM rawData.stock_movement AS t
      GROUP BY t.id
    );
    

    Here is an example:

    #standardSQL
    WITH T AS (
      SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
      SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
      SELECT 1, 'baz', TIMESTAMP '2017-04-03')
    SELECT latest_row.*
    FROM (
      SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
      FROM rawData.stock_movement AS t
      GROUP BY t.id
    );
    

    The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

    0 讨论(0)
提交回复
热议问题