Can someone please explain me why the sequential version π-approximation was faster than the parallel one?
I can\'t figure it out
I\'m playing around with us
I get even worse results running in parallel on my machine (3.0 GHz Intel Core i7, two cores, four threads):
sequential: PI ~ 3.14175124 calculated in 4952 msecs
parallel: PI ~ 3.14167776 calculated in 21320 msecs
I suspect the main reason is that Math.random() is thread-safe, and so it synchronizes around every call. Since there are multiple threads all trying to get random numbers at the same time, they're all contending for the same lock. This adds a tremendous amount of overhead. Note that the specification for Math.random() says the following:
This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.
To avoid lock contention, use ThreadLocalRandom instead:
long count = LongStream.rangeClosed(1, NUM_SAMPLES)
.parallel()
.filter(e -> {
ThreadLocalRandom cur = ThreadLocalRandom.current();
double x = cur.nextDouble();
double y = cur.nextDouble();
return x * x + y * y < 1;
})
.count();
This gives the following results:
sequential2: PI ~ 3.14169156 calculated in 1171 msecs
parallel2: PI ~ 3.14166796 calculated in 648 msecs
which is 1.8x speedup, not too bad for a two-core machine. Note that this is also faster when run sequentially, probably because there's no lock overhead at all.
Aside: Normally for benchmarks I'd suggest using JMH. However, this benchmark seems to run long enough that it gives a reasonable indication of relative speeds. For more precise results, though, I do recommend using JMH.
UPDATE
Here are additional results (requested by user3666197 in comments), using a NUM_SAMPLES value of 1_000_000_000 compared to the original 100_000_000. I've copied the results from above for easy comparison.
NUM_SAMPLES = 100_000_000
sequential: PI ~ 3.14175124 calculated in 4952 msecs
parallel: PI ~ 3.14167776 calculated in 21320 msecs
sequential2: PI ~ 3.14169156 calculated in 1171 msecs
parallel2: PI ~ 3.14166796 calculated in 648 msecs
NUM_SAMPLES = 1_000_000_000
sequential: PI ~ 3.141572896 calculated in 47730 msecs
parallel: PI ~ 3.141543836 calculated in 228969 msecs
sequential2: PI ~ 3.1414865 calculated in 12843 msecs
parallel2: PI ~ 3.141635704 calculated in 7953 msecs
The sequential and parallel results are (mostly) the same code as in the question, and sequential2 and parallel2 are using my modified ThreadLocalRandom code. The new timings are overall roughly 10x longer, as one would expect. The longer parallel2 run isn't quite as fast as one would expect, though it's not totally out of line, showing about a 1.6x speedup on a two-core machine.