Flink: DataSet.count() is bottleneck - How to count parallel?

送分小仙女□ 提交于 2020-05-27 12:07:12

问题


I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. What I have so far is this:

DataSet<MyClass> ds = ...;
long num = ds.count();

When executing this, in my flink log it says

12/03/2016 19:47:27 DataSink (count())(1/1) switched to RUNNING

So there is only one CPU used (i have four and other commands like reduce use all of them).

I think count() internally collects the DataSet from all four CPUs and counts them sequentially instead of having each CPU count its part and then sum it up. Is that true?

If yes, how can I take advantage of all my CPUs? Would it be a good idea to first map my DataSet to a 2-tuple that contains the original value as first item and the long value 1 as second item and then aggregate it using the SUM function?

For example, the DataSet would be mapped to DataSet> where the Long would always be 1. So when I sum up all items the sum of the second value of the tuple would be the correct count value.

What is the best practice to count items in a DataSet?

Regards Simon


回答1:


DataSet#count() is a non-parallel operation and thus can only use a single thread.

You would do a count-by-key to get parallelization and apply a final sum over you key counts to get to overall count to speed up you computation.




回答2:


Is this a good solution?

DataSet<Tuple1<Long>> x = ds.map(new MapFunction<MyClass, Tuple1<Long>>() { 
    @Override public Tuple1<Long> map(MyClass t) throws Exception { 
        return new Tuple1<Long>(1L); 
    } 
}).groupBy(0).sum(0);

Long c = x.collect().iterator().next().f0;


来源:https://stackoverflow.com/questions/40951458/flink-dataset-count-is-bottleneck-how-to-count-parallel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!