问题
I want to run a pipeline on google dataflow that depends on the output of another pipeline. Right now I am just running two pipelines after each other with the DirectRunner locally:
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.ReadFromText(known_args.input)
| SomeTransform()
| beam.io.WriteToText('temp'))
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.ReadFromText('temp*')
| AnotherTransform()
| beam.io.WriteToText(known_args.output))
My questions are the following:
- Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?
- Is there a preferred method to run two pipelines consequetively after each other?
- Also is there a recommended way to separate those pipelines into different files in order to test them better?
回答1:
Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?
No, Dataflow simply executes a pipeline. It has no features for managing dependent pipeline executions.
Update: For clarification, Apache Beam does provide a mechanism for waiting for a pipeline to complete execution. See the waitUntilFinish() method of the PipelineResult class. Reference: PipelineResult.waitUntilFinish().
Is there a preferred method to run two pipelines consequetively after each other?
Consider using a tool like Apache Airflow for managing dependent pipelines. You can even implement a simple bash script to deploy one pipeline after another one has finished.
Also is there a recommended way to separate those pipelines into different files in order to test them better?
Yes, separate files. This is just good code organization, not necessarily better for testing.
来源:https://stackoverflow.com/questions/49191548/executing-a-pipeline-only-after-another-one-finishes-on-google-dataflow