What is the danger of side effects in Java 8 Streams?

后端 未结 3 1959
余生分开走
余生分开走 2020-12-16 01:57

I\'m trying to understand warnings I found in the Documentation on Streams. I\'ve gotten in the habit of using forEach() as a general purpose iterator. And that\'s lead me t

相关标签:
3条回答
  • 2020-12-16 02:32

    Side effects frequently makes assumptions about state and context. In parallel you are not guaranteed a specific order you see the elements in and multiple threads may run at the same time.

    Unless you code for this this can give very subtle bugs which is very hard to track and fix when trying to go parallel.

    0 讨论(0)
  • 2020-12-16 02:35

    I believe the documentation is mentioning about the side effects demonstrated by the below code:

    List<Integer> matched = new ArrayList<>();
    List<Integer> elements = new ArrayList<>();
    
    for(int i=0 ; i< 10000 ; i++) {
        elements.add(i);
    }
    
    elements.parallelStream()
        .forEach(e -> {
            if(e >= 100) {
                matched.add(e);
            }
        });
    System.out.println(matched.size());
    

    This code streams through the list in parallel, and tries to add elements into other list if they match the certain criteria. As the resultant list is not synchronised, you will get java.lang.ArrayIndexOutOfBoundsException while executing the above code.

    The fix would be to create a new list and return, e.g.:

    List<Integer> elements = new ArrayList<>();
    for(int i=0 ; i< 10000 ; i++) {
        elements.add(i);
    }   
    List<Integer> matched = elements.parallelStream()
        .filter(e -> e >= 100)
        .collect(Collectors.toList());
    System.out.println(matched.size());
    
    0 讨论(0)
  • 2020-12-16 02:40

    From the Javadoc:

    Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.

    The problem here is that if you access a mutable state, you loose on two side:

    • Safety, because you need synchronization which the Stream tries to minimize
    • Performance, because the required synchronization cost you (in your example, if you use a ConcurrentHashMap, this has a cost).

    Now, in your example, there are several points here:

    • If you want to use Stream and multi threading stream, you need to use parralelStream() as in myThings.parralelStream(); as it stands, the forEach method provided by java.util.Collection is simple for each.
    • You use HashMap as a static member and you mutate it. HashMap is not threadsafe; you need to use a ConcurrentHashMap.

    In the lambda, and in the case of a Stream, you must not mutate the source of your stream:

    myThings.stream().forEach(thing -> myThings.remove(thing));
    

    This may work (but I suspect it will throw a ConcurrentModificationException) but this will likely not work:

    myThings.parallelStream().forEach(thing -> myThings.remove(thing));
    

    That's because the ArrayList is not thread safe.

    If you use a synchronized view (Collections.synchronizedList), then you would have a performance it because you synchronize on each access.

    In your example, you would rather use:

    sortOrderCache = myThings.stream()
                             .collect(Collectors.groupingBy(
                               Thing::getId, Thing::getSortOrder);
    codeNameCache= myThings.stream()
                           .collect(Collectors.groupingBy(
                             Thing::getId, Thing::getCodeName);
    

    The finisher (here the groupingBy) does the work you were doing and might be called sequentially (I mean, the Stream may be split across several thread, the the finisher may be invoked several times (in different thread) and then it might need to merge.

    By the way, you might eventually drop the codeNameCache/sortOrderCache and simply store the id->Thing mapping.

    0 讨论(0)
提交回复
热议问题