Java 8 streams serial vs parallel performance

问题

On my machine, the program below prints:

OptionalLong[134043]
 PARALLEL took 127869 ms
OptionalLong[134043]
 SERIAL took 60594 ms

It's not clear to my why executing the program in serial is faster than executing it in parallel. I've given both programs -Xms2g -Xmx2g on an 8gb box thats relatively quiet. Can someone clarify whats going on?

import java.util.stream.LongStream;
import java.util.stream.LongStream.Builder;

public class Problem47 {

    public static void main(String[] args) {

        final long startTime = System.currentTimeMillis();
        System.out.println(LongStream.iterate(1, n -> n + 1).parallel().limit(1000000).filter(n -> fourConsecutives(n)).findFirst());
        final long endTime = System.currentTimeMillis();
        System.out.println(" PARALLEL took " +(endTime - startTime) + " ms");

        final long startTime2 = System.currentTimeMillis();
        System.out.println(LongStream.iterate(1, n -> n + 1).limit(1000000).filter(n -> fourConsecutives(n)).findFirst());
        final long endTime2 = System.currentTimeMillis();
        System.out.println(" SERIAL took " +(endTime2 - startTime2) + " ms");
    }

    static boolean fourConsecutives(final long n) {
        return distinctPrimeFactors(n).count() == 4 &&
                distinctPrimeFactors(n + 1).count() == 4 &&
                distinctPrimeFactors(n + 2).count() == 4 &&
                distinctPrimeFactors(n + 3).count() == 4;
    }

    static LongStream distinctPrimeFactors(long number) {
        final Builder builder = LongStream.builder();
        final long limit = number / 2;
        long n = number;
        for (long i = 2; i <= limit; i++) {
            while (n % i == 0) {
                builder.accept(i);
                n /= i;
            }
        }
        return builder.build().distinct();
    }

}

回答1:

While Brian Goetz is right about your setup, e.g. that you should use .range(1, 1000000) rather than .iterate(1, n -> n + 1).limit(1000000) and that your benchmark method is very simplistic, I want to emphasize the important point:

even after fixing these issues, even using a wall clock and the TaskManager you can see that there’s something wrong. On my machine the operation takes about half a minute and you can see that the parallelism drops to single core after about two seconds. Even if a specialized benchmark tool could produce different results it wouldn’t matter unless you want to run your final application within a benchmark tool all the time…

Now we could try to mock more about your setup or tell you that you should learn special things about the Fork/Join framework like the implementors did on the discussion list.

Or we try an alternative implementation:

ExecutorService es=Executors.newFixedThreadPool(
                       Runtime.getRuntime().availableProcessors());
AtomicLong found=new AtomicLong(Long.MAX_VALUE);
LongStream.range(1, 1000000).filter(n -> found.get()==Long.MAX_VALUE)
    .forEach(n -> es.submit(()->{
        if(found.get()>n && fourConsecutives(n)) for(;;) {
            long x=found.get();
            if(x<n || found.compareAndSet(x, n)) break;
        }
    }));
es.shutdown();
try { es.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS); }
catch (InterruptedException ex) {throw new AssertionError(ex); }
long result=found.get();
System.out.println(result==Long.MAX_VALUE? "not found": result);

On my machine it does what I would expect from parallel execution taking only slightly more than ⟨sequential time⟩/⟨number of cpu cores⟩. Without changing anything in your fourConsecutives implementation.

The bottom line is that, at least when processing a single item takes significant time, the current Stream implementation (or the underlying Fork/Join framework) has problems as already discussed in this related question. If you want reliable parallelism I would recommend to use proved and tested ExecutorServices. As you can see in my example, it does not mean to drop the Java 8 features, they fit together well. Only the automated parallelism introduced with Stream.parallel should be used with care (given the current implementation).

回答2:

We can make it easier to execute in parallel, but we can't necessarily make parallelism easy.

The culprit in your code is the combination of limit+parallel. Implementing limit() is trivial for sequential streams, but fairly expensive for parallel streams. This is because the definition of the limit operation is tied to the encounter order of the stream. Streams with limit() are often slower in parallel than in sequential, unless the computation done per element is very high.

Your choice of stream source is also limiting parallelism. Using iterate(0, n->n+1) gives you the positive integers, but iterate is fundamentally sequential; you can't compute the nth element until you've computed the (n-1)th element. So when we try and split this stream, we end up splitting (first, rest). Try using range(0,k) instead; this splits much more nicely, splitting neatly by halves with random access.

来源：https://stackoverflow.com/questions/24027247/java-8-streams-serial-vs-parallel-performance

标签

java

performance

java-8

java-stream