I\'m trying to understand warnings I found in the Documentation on Streams. I\'ve gotten in the habit of using forEach() as a general purpose iterator. And that\'s lead me t
Side effects frequently makes assumptions about state and context. In parallel you are not guaranteed a specific order you see the elements in and multiple threads may run at the same time.
Unless you code for this this can give very subtle bugs which is very hard to track and fix when trying to go parallel.
I believe the documentation is mentioning about the side effects demonstrated by the below code:
List<Integer> matched = new ArrayList<>();
List<Integer> elements = new ArrayList<>();
for(int i=0 ; i< 10000 ; i++) {
elements.add(i);
}
elements.parallelStream()
.forEach(e -> {
if(e >= 100) {
matched.add(e);
}
});
System.out.println(matched.size());
This code streams through the list in parallel, and tries to add elements into other list if they match the certain criteria. As the resultant list is not synchronised, you will get java.lang.ArrayIndexOutOfBoundsException
while executing the above code.
The fix would be to create a new list and return, e.g.:
List<Integer> elements = new ArrayList<>();
for(int i=0 ; i< 10000 ; i++) {
elements.add(i);
}
List<Integer> matched = elements.parallelStream()
.filter(e -> e >= 100)
.collect(Collectors.toList());
System.out.println(matched.size());
From the Javadoc:
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
The problem here is that if you access a mutable state, you loose on two side:
Stream
tries to minimizeConcurrentHashMap
, this has a cost).Now, in your example, there are several points here:
Stream
and multi threading stream, you need to use parralelStream()
as in myThings.parralelStream()
; as it stands, the forEach
method provided by java.util.Collection
is simple for each
.HashMap
as a static
member and you mutate it. HashMap
is not threadsafe; you need to use a ConcurrentHashMap
.In the lambda, and in the case of a Stream
, you must not mutate the source of your stream:
myThings.stream().forEach(thing -> myThings.remove(thing));
This may work (but I suspect it will throw a ConcurrentModificationException
) but this will likely not work:
myThings.parallelStream().forEach(thing -> myThings.remove(thing));
That's because the ArrayList
is not thread safe.
If you use a synchronized view (Collections.synchronizedList
), then you would have a performance it because you synchronize on each access.
In your example, you would rather use:
sortOrderCache = myThings.stream()
.collect(Collectors.groupingBy(
Thing::getId, Thing::getSortOrder);
codeNameCache= myThings.stream()
.collect(Collectors.groupingBy(
Thing::getId, Thing::getCodeName);
The finisher (here the groupingBy
) does the work you were doing and might be called sequentially (I mean, the Stream may be split across several thread, the the finisher may be invoked several times (in different thread) and then it might need to merge.
By the way, you might eventually drop the codeNameCache
/sortOrderCache
and simply store the id->Thing mapping.