Quickly degrading stream throughput with chained operations?

问题

I expected that simple intermediate stream operations, such as limit(), have very little overhead. But the difference in throughput between these examples is actually significant:

final long MAX = 5_000_000_000L;

LongStream.rangeClosed(0, MAX)
          .count();
// throughput: 1.7 bn values/second


LongStream.rangeClosed(0, MAX)
          .limit(MAX)
          .count();
// throughput: 780m values/second

LongStream.rangeClosed(0, MAX)
          .limit(MAX)
          .limit(MAX)
          .count();
// throughput: 130m values/second

LongStream.rangeClosed(0, MAX)
          .limit(MAX)
          .limit(MAX)
          .limit(MAX)
          .count();
// throughput: 65m values/second

I am curious: What is the reason for the quickly degrading throughput? Is it a consistent pattern with chained stream operations or my test setup? (I did not use JMH so far, just set up a quick experiment with a stopwatch)

回答1:

limit will result in a slice being made of the stream, with a split iterator (for parallel operation). In one word: inefficient. A large overhead for a no-op here. And that two consecutive limit calls result in two slices is a shame.

You should take a look at the implementation of IntStream.limit.

As Streams are still relative new, optimization should come last; when production code exists. Doing limit 3 times seems a bit far-fetched.

回答2:

This is an under implementation in the Stream API (don't know how to call it otherwise).

In the first example, you know the count without actually counting - there are no filter (for example) operations that might clear the internal flag called SIZED. It's actually a bit interesting if you change this and inspect:

System.out.println(
            LongStream.rangeClosed(0, Long.MAX_VALUE)
                    .spliterator()
                    .hasCharacteristics(Spliterator.SIZED)); // reports false

System.out.println(
            LongStream.rangeClosed(0, Long.MAX_VALUE - 1) // -1 here
                    .spliterator()
                    .hasCharacteristics(Spliterator.SIZED)); // reports true

And limit - even if there are no fundamental (AFAIK) limitations, does not introduce SIZED flag:

System.out.println(LongStream.rangeClosed(0, MAX)
            .limit(MAX)
            .spliterator()
            .hasCharacteristics(Spliterator.SIZED)); // reports false

Since you count everywhere, the fact that internally the Stream API does not know if stream is SIZED, it just counts; while if the Stream is SIZED - reporting count would be well, instant.

When you add limit a few times, you are just making it worse, since it has to limit those limits, every single time.

Things have improved in java-9 for example, for the case:

System.out.println(LongStream.rangeClosed(0, MAX)
            .map(x -> {
                System.out.println(x);
                return x;
            })
            .count());

In this case map is not computed at all, since there is no need for it to - no intermediate operation changes the size of the stream.

Theoretically a Stream API might see that you are limiting and 1) introduce the SIZED flag 2) see that you have multiple calls of limit and just probably take the last one. At the moment this is not done, but this has a very limited scope, how many people would abuse limit this way? So don't expect any improvements on this part soon.

来源：https://stackoverflow.com/questions/52646345/quickly-degrading-stream-throughput-with-chained-operations

标签

java

java-stream

throughput