Right way to handle one-to-many stages in Dataflow

问题

I have a (Java) batch pipeline that has follow the following pattern:

(FileIO)
(ExtractText > input=1 file, output=millions of lines of text)
(ProcessData)

The ProcessData stage contains slow parts (matching data against big whitelists) and needs to be scaled on several workers, which should not be an issue since it only contains DoFns. However it would seem that my one-to-many stage forces all the outputs to be processed only by one worker (instantiating more workers makes them all idle except one, or be downscaled if autoscaling is enabled).

Based on other stackoverflow entries, I have tried shuffling via Reshuffle.viaRandomKey(). This does not work because Reshuffle contains a GroupByKey which loads all the result in memory, causing OOM, even if I window it beforehand via Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))

Another option would be to create a CustomSource to replace the first two stages, but I find this method inadequate because 1) the documentation of custom sources is severely lacking 2) it takes more time and code to implement 3) this one-to-many issue could well be encountered in the middle of a pipeline, where I couldn't create custom sources.

How should I handle one-to-many stages in a Dataflow pipeline ?

来源：https://stackoverflow.com/questions/51397077/right-way-to-handle-one-to-many-stages-in-dataflow

标签

java

performance

one-to-many

google-cloud-dataflow

apache-beam

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!