Reject data load attempt to BigQuery for existing data

前端 未结 1 474
温柔的废话
温柔的废话 2021-01-17 02:17

I\'m loading data from pandas dataframes to BigQuery using pandas-gbq package:

df.to_gbq(\'dataset.table\', project_id         


        
相关标签:
1条回答
  • 2021-01-17 02:53

    Is there a way to reject the loading attempt if the key already appears in the BigQuery table?

    No, since BigQuery doesn't support keys in a similar way other database does. There are 2 typical use-cases to solve this:

    Option 1:
    Upload the data with a timeStamp and use a merge command to remove duplicates

    See this link on how to do this, This is an example

    MERGE `DATA` AS target
    USING `DATA` AS source
    ON target.key = source.key
    WHEN MATCHED AND target.ts < source.ts THEN 
    DELETE
    

    Note: In this case, you pay for the merge scanning but keep your table row unique.

    Option 2:

    Upload the data with a timestamp and use ROW_NUMBER window function to fetch the latest record, This is an example with your data:

    WITH DATA AS (
        SELECT 'sd3e' AS key, 0.3 as value,  1 as r_order, '2019-04-14 00:00:00' as ts  UNION ALL
        SELECT 'sd3e' AS key, 0.2 as value,  2 as r_order, '2019-04-14 01:00:00' as ts  UNION ALL
        SELECT 'sd4r' AS key, 0.1 as value,  1 as r_order, '2019-04-14 00:00:00' as ts  UNION ALL
        SELECT 'sd4r' AS key, 0.5 as value,  2 as r_order, '2019-04-14 01:00:00' as ts  
    )
    
    SELECT * 
    FROM (
        SELECT * ,ROW_NUMBER() OVER(PARTITION BY key order by ts DESC) rn 
        FROM `DATA` 
    )
    WHERE rn = 1
    

    This produces the expected results as follow:

    Note: This case doesn't incur extra charges, however, you always have to make sure to use window function when fetching from the table

    0 讨论(0)
提交回复
热议问题