问题
I have created a parsing library that accepts a provided input and returns a stream of Records. A program then calls this library and processes the results. In my case, my program is using something like
recordStream.forEach(r -> insertIntoDB(r));
One of the types of input that can be provided to the parsing library is a flat file, which may have a header row. As such, the parsing library can be configured to skip a header row. If a header row is configured, it adds a skip(n) element to the return, e.g.
Files.lines(input)**.skip(1)**.parallel().map(r -> createRecord(r));
The parsing library returns the resulting Stream.
But, it seems that skip, parallel and forEach do not play nicely togetherThe end programmer must instead invoke forEachOrdered, but it is poor design to put this requirement on the programmer, to expect them to know they must use forEachOrdered if dealing with an input type of a file with a header row.
How can I enforce the ordered requirement myself when necessary, within the construction of the returned stream chain, to return a fully functional stream to the program writer, instead of a stream with hidden limitations? Is the answer to wrap the stream in another stream?
回答1:
forEachOrdered
is necessary not because of the skip()
, but because your Stream is parallel. Even if the stream is parallel, the stream will skip the first element, as indicated in the documentation:
While skip() is generally a cheap operation on sequential stream pipelines, it can be quite expensive on ordered parallel pipelines, especially for large values of n, since skip(n) is constrained to skip not just any n elements, but the first n elements in the encounter order.
It's clearly documented that forEach
doesn't necessarily respect the order. Not using forEachOrdered
when you care about the order is just a misuse of the Stream API:
The behavior of this operation is explicitly nondeterministic. For parallel stream pipelines, this operation does not guarantee to respect the encounter order of the stream, as doing so would sacrifice the benefit of parallelism.
I would not return a parallel stream from the library. I would return a sequential one (where forEach would respect the order), and let the caller call parallel()
and assume the consequences if it wants to.
Using a parallel stream by default is a bad idea.
回答2:
Considering the relevant scenario where
- The stream source is setup using
skip
- the client code is requesting
parallel()
execution - the client code is chaining an unordered terminal action like
forEach
- the code runs on a JRE older than
1.8u60
we have quite a special combination of circumstances, all being outside of the control of the particular library function that will chain the .map(r -> createRecord(r))
operation.
I don’t think that the responsibility lies at this point. Well, in general, the application code is not responsible for fixing things that are already recognized as JRE bugs and fixed in the up to date versions.
If for whatever reason you consider the necessity of providing a work-around for older JREs, it would be up to the stream source requiring the skip
operation, to do this.
For this specific case, it’s not so hard. You may create the BufferedReader directly, invoke readLine() to skip the first line and then return the result of lines(), which allows to process all remaining lines. That might be even more efficient as a parallel Stream bearing a skip
operation.
A more general solution would be an “eager skip first” operation like this:
public static <T> Stream<T> skipFirstImmediately(Stream<T> source) {
Spliterator<T> sp=source.spliterator();
sp.tryAdvance(skipped -> {});
return StreamSupport.stream(sp, source.isParallel());
}
Note that when using this method, due to properties of the current Stream implementation, it can be beneficial to turn the source Stream to parallel before invoking this method rather than turning the resulting Stream to parallel, if parallel execution is desired.
This can be verified by comparing the output of
skipFirstImmediately(IntStream.range(0, 10).parallel().boxed())
.peek(x -> System.out.println(Thread.currentThread()))
.forEach(System.out::println);
and
skipFirstImmediately(IntStream.range(0, 10).boxed()).parallel()
.peek(x -> System.out.println(Thread.currentThread()))
.forEach(System.out::println);
which will be correct in either case but not exploiting the SMP capabilities in the latter.
来源:https://stackoverflow.com/questions/38104601/how-to-design-a-returned-stream-that-may-use-skip