问题
I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?
Thanks,
回答1:
If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn
in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
CountingInput.unbounded().withRate(1, Duration.standardMinutes(1)))
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
...
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.
来源:https://stackoverflow.com/questions/38019761/running-periodic-dataflow-job