Google Dataflow “elementCountExact” aggregation

这一生的挚爱 提交于 2019-12-11 08:14:57

问题


I'm trying to aggregate a PCollection<String> into PCollection<List<String>> with ~60 elements each.

They will be sent to an API which accepts 60 elements per request. Currently I'm trying it by windowing, but there is only elementCountAtLeast, so I have to collect them into a list and count again and split in case it is too long. This is quite cumbersome and results in a lot of lists with just few elements:

Repeatedly.forever(AfterFirst.of(
                    AfterPane.elementCountAtLeast(maxNrOfelementsPerList),
                    AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1)))))
            .withAllowedLateness(Duration.ZERO)
            .discardingFiredPanes())
            .apply("CollectIntoLists", Combine.globally(new StringToListCombinator()).withoutDefaults())
            .apply("SplitListsToMaxSize", ParDo.of(new DoFn<List<String>, List<String>>() {
                @ProcessElement
                public void apply(ProcessContext pc) {
                    splitList(pc.element(), maxNrOfelementsPerList).forEach(pc::output);
                }
            }));

Is there any direct and more consistent way to do this aggregation?


回答1:


This can be built using the State API in Dataflow 2.x.

Basically, you would write a Stateful DoFn that had two pieces a state -- a count of the number of elements and a "bag" of the elements that have been buffered.

When an element arrives, you add it to the bag and increment the count. You then check the count, and if it is 60 you output it, and clear both pieces of state.

Since each key of a Stateful DoFn will run on a single machine, it would probably be good to randomly distribute your elements across N keys, so that you can scale up to N machines (multiple keys may run on one machine).



来源:https://stackoverflow.com/questions/44278697/google-dataflow-elementcountexact-aggregation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!