问题
According to release notes of dataflow 2.X, IntraBundleParallelization is removed. Is there a way to control/increase parallelism of DoFns on dataflow 2.1.0 ?
I was getting better performance when I used IntrabundleParallelization on 1.9.0 version of dataflow.
回答1:
It was removed because its implementation keeps a handle on the ProcessContext of a ProcessElement call after the call completes, and this is unsafe and not guaranteed to work.
However, I agree that it was a useful abstraction, and it is unfortunate that we don't have a replacement yet.
As a workaround, you can try the following:
- In your DoFn's
@Setup, create anExecutorwith the needed number of threads - In your DoFn's
@StartBundle, create anExecutorCompletionServicewrapping the executor - In
@ProcessElement, submit aFutureto it representing the result of processing the element - In
@ProcessElement, alsopoll()theCompletionServicefor completed futures and output their results - In
@FinishBundle, wait for all remaining futures to complete, output their results, and shut down theCompletionService.
Remember to not use the ProcessContext in your futures. ProcessContext can only be used from the current thread and from within the current ProcessElement call.
来源:https://stackoverflow.com/questions/47023871/is-there-an-alternative-to-intrabundleparallelization-in-dataflow-2-1-0