How to specify insertId when spreaming insert to BigQuery using Apache Beam

后端 未结 2 1457
长发绾君心
长发绾君心 2021-01-14 03:00

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#d

相关标签:
2条回答
  • 2021-01-14 03:25

    As Felipe mentioned in the comment, it seems that Dataflow is already using insertId for itself to implement "exactly once". so we can not manually specify insertId.

    0 讨论(0)
  • 2021-01-14 03:46
    • Pub/Sub + Beam/Dataflow + BigQuery: "Exactly once" should be guaranteed, and you don't need to worry much about this. That guarantee is stronger when you ask Dataflow to insert to BigQuery using FILE_LOADS instead of STREAMING_INSERTS, for now.

    • Kafka + Beam/Dataflow + BigQuery: If a message can be emitted more than once from Kafka (e.g. if the producer retried the insertion), then you need to take care of de-duplication. Either in BigQuery (as currently implemented, according to your comment), or in Dataflow with a .apply(Distinct.create()) transform.

    0 讨论(0)
提交回复
热议问题