How to specify insertId when spreaming insert to BigQuery using Apache Beam

天涯浪子 提交于 2019-12-01 08:56:46
Felipe Hoffa
  • Pub/Sub + Beam/Dataflow + BigQuery: "Exactly once" should be guaranteed, and you don't need to worry much about this. That guarantee is stronger when you ask Dataflow to insert to BigQuery using FILE_LOADS instead of STREAMING_INSERTS, for now.

  • Kafka + Beam/Dataflow + BigQuery: If a message can be emitted more than once from Kafka (e.g. if the producer retried the insertion), then you need to take care of de-duplication. Either in BigQuery (as currently implemented, according to your comment), or in Dataflow with a .apply(Distinct.create()) transform.

As Felipe mentioned in the comment, it seems that Dataflow is already using insertId for itself to implement "exactly once". so we can not manually specify insertId.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!