How to ingest data from a GCS bucket via Dataflow as soon as a new file is put into it?

走远了吗. 提交于 2021-01-28 07:37:16

问题


I have a use case where I need to input data from google Cloud Storage bucket as soon as its made available in the form of a new file in a storage bucket via Dataflow .

How do I trigger the execution of the Dataflow job as soon as the new data(file) becomes available or added to the storage bucket ?


回答1:


If your pipelines are written in Java, then you can use Cloud Functions and Dataflow templating.

I'm going to assume you're using 1.x SDK (it's also possible with 2.x)

  1. Write your Pipeline and specify the "TemplatingDataflowPipelineRunner" as the runner
  2. Write a Cloud Function that is set up to listen and react to new objects (in this case CSV files) that arrive into your bucket.
  3. The Cloud Function kicks off the Dataflow pipeline, and passes the name of the new file as a parameter to it.

See here for a walkthrough on how to build this pipeline. Full disclosure: I work for Shine.



来源:https://stackoverflow.com/questions/43786052/how-to-ingest-data-from-a-gcs-bucket-via-dataflow-as-soon-as-a-new-file-is-put-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!