Apache Beam -> BigQuery - insertId for deduplication not working

故事扮演 提交于 2019-12-02 12:11:34

问题


I am streaming data from kafka to BigQuery using apache beam with google dataflow runner. I wanted to make use of insertId for deduplication, that I found described in google docs. But even tho inserts are happening within few seconds from each other I still see a lot of rows with the same insertId. Now I'm wondering that perhaps I am not using the API correctly to take advantage of deduplication mechanism for streaming inserts offered by BQ.

My code in beam for writing looks as follows:

payments.apply("Write Fx Payments to BQ", BigQueryIO.<FxPayment>write()
            .withFormatFunction(ps -> FxTableRowConverter.convertFxPaymentToTableRow(ps))
            .to(bqTradePaymentTable)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

Besides all other fields I am setting insertId directly on TableRow in FxTableRowConverter.convertFxPaymentToTableRow method passed to BigQueryIO as format function:

row.set("insertId", insertId);

I also added that field as a column to BQ. Without it, it was failing on inserts (obviously). I couldn't find any other way to set insertId directly on BigQueryIO other than adding it to TableRow object.

Is this the correct way of using this? Because it does not work for me, I am seeing many duplicates even tho I shouldn't, since like I already mentioned inserts are happening within seconds. BigQuery doc states that streaming buffer is keeping insertId for at least one minute.


回答1:


You can't manually specify insertId for BigQuery streaming in Dataflow https://stackoverflow.com/a/54193825/1580227



来源:https://stackoverflow.com/questions/46401195/apache-beam-bigquery-insertid-for-deduplication-not-working

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!