Running periodic Dataflow job

问题

I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?

Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?

Thanks,

回答1:

If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:

PCollection<Long> impulse = p.apply( CountingInput.unbounded().withRate(1, Duration.standardMinutes(1))) PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore)); PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable)); ...

This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.

After reading from both APIs you can then window/join as necessary.

来源：https://stackoverflow.com/questions/38019761/running-periodic-dataflow-job

标签

google-cloud-dataflow

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!