Flink: DataSet.count() is bottleneck - How to count parallel?

问题

I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. What I have so far is this:

DataSet<MyClass> ds = ...;
long num = ds.count();

When executing this, in my flink log it says

12/03/2016 19:47:27 DataSink (count())(1/1) switched to RUNNING

So there is only one CPU used (i have four and other commands like reduce use all of them).

I think count() internally collects the DataSet from all four CPUs and counts them sequentially instead of having each CPU count its part and then sum it up. Is that true?

If yes, how can I take advantage of all my CPUs? Would it be a good idea to first map my DataSet to a 2-tuple that contains the original value as first item and the long value 1 as second item and then aggregate it using the SUM function?

For example, the DataSet would be mapped to DataSet> where the Long would always be 1. So when I sum up all items the sum of the second value of the tuple would be the correct count value.

What is the best practice to count items in a DataSet?

Regards Simon

回答1:

DataSet#count() is a non-parallel operation and thus can only use a single thread.

You would do a count-by-key to get parallelization and apply a final sum over you key counts to get to overall count to speed up you computation.

回答2:

Is this a good solution?

DataSet<Tuple1<Long>> x = ds.map(new MapFunction<MyClass, Tuple1<Long>>() { 
    @Override public Tuple1<Long> map(MyClass t) throws Exception { 
        return new Tuple1<Long>(1L); 
    } 
}).groupBy(0).sum(0);

Long c = x.collect().iterator().next().f0;

来源：https://stackoverflow.com/questions/40951458/flink-dataset-count-is-bottleneck-how-to-count-parallel

标签

java

MapReduce

apache-flink