Ignore duplicate records while appending in BigQuery

大憨熊 提交于 2020-01-06 14:27:29

问题


We are writing the data from MySql to BigQuery. We have set some indicators like

  • Insert - If the records is being added first time, then save it with 'I' in Indicator field
  • Update - If the record has some updated data, then save it with 'U' in the Indicator field and ignore duplicate records if not changed.

But in case of 'Update' it's writing duplicated records as well, which has not even changed. Following is the query we are currently using to insert the data into BigQuery table. What changes can we made to this query?

"insert into `actual_table` 

(
    Id,
   ...
)
select
temp.Id,
...
case when actual.Id is null then 'I'
when actual.Id is not null and actual.field1<>temp.field1 then 'U'
end as Indicator,
FROM `temp_table` temp 
left outer join `actual_table` actual
on temp.Id= actual.Id"

Actual table is the table in BigQuery whereas temp table is the staging table on bigquery. Everytime we read data from MySql, we store it in temp table.

Thanks


回答1:


I suspect that it is not possible your code insert duplicate field (Id and field1 are same) with 'U' indicator as you mentioned in your code, Your code must return an error if there is a duplicate field because there will be no data in "Indicator" field and column count will mismatch, put an else field to case and use another select query to filter fields with "U" or "I" Indicator If Indicator field is not necessary, use merge command...

"insert into `actual_table` 

(
    Id,
   ...
)
select * from
(
select
temp.Id,
...
case when actual.Id is null then 'I'
when actual.Id is not null and actual.field1<>temp.field1 then 'U'
else null 
end as Indicator,
FROM `temp_table` temp 
left outer join `actual_table` actual
on temp.Id= actual.Id
)
where Indicator is not null
"



回答2:


Another option I like with BigQuery is doing the inserts using merge DML, It's quite a neat solution if this suite your use case. You can see more details in this link.

Sql Example:

MERGE
    `mytable` as tgt
USING
    `mytable` as src
ON FALSE
WHEN NOT MATCHED AND src._PARTITIONTIME = '2019-02-21'
THEN INSERT (_PARTITIONTIME, fields...) VALUES (_PARTITIONTIME, fields...)
WHEN NOT MATCHED BY SOURCE AND tgt._PARTITIONTIME = '2019-02-21'
THEN DELETE


来源:https://stackoverflow.com/questions/55222951/ignore-duplicate-records-while-appending-in-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!