Batch PCollection in Beam/Dataflow

天涯浪子 提交于 2019-12-02 10:12:37

问题


I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N). So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam?


回答1:


Edit, looks like: Google Dataflow "elementCountExact" aggregation

You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N). You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:

 Repeatedly.forever(AfterFirst.of(
  AfterPane.elementCountAtLeast(N),
  AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))

But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers in Beam’s programming guide



来源:https://stackoverflow.com/questions/44348085/batch-pcollection-in-beam-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!