问题
This takes around 1 second
(1 to 1000000).map(_+3)
While this gives java.lang.OutOfMemoryError: Java heap space
(1 to 1000000).par.map(_+3)
EDIT:
I have standard scala 2.9.2 configuration. I am typing this on scala prompt. And in the bash i can see [ -n "$JAVA_OPTS" ] || JAVA_OPTS="-Xmx256M -Xms32M"
AND i dont have JAVA_OPTS set in my env.
1 million integers = 8MB, creating list twice = 16MB
回答1:
It seems definitely related to the JVM memory option and to the memory required to stock a Parralel collection. For example:
scala> (1 to 1000000).par.map(_+3)
ends up with a OutOfMemoryError the third time I tried to evaluate it, while
scala> (1 to 1000000).par.map(_+3).seq
never failed. The issue is not the computation its the storage of the Parrallel collection.
回答2:
Several reasons for the failure:
- Parallel collections are not specialized, so the objects get boxed. This means that you can't multiply the number of elements with 8 to get the memory usage.
- Using
mapmeans that the range is converted into a vector. For parallel vectors an efficient concatenation has not been implemented yet, so merging intermediate vectors produced by different processors proceeds by copying - requiring more memory. This will be addressed in future releases. - The REPL stores previous results - the object evaluated in each line remains in memory.
回答3:
There are two issues here, the amount of memory required to store a parallel collection and the amount of memory required to 'pass through' a parallel collection.
The difference can be seen between these two lines:
(1 to 1000000).map(_+3).toList
(1 to 1000000).par.map(_+3).toList
The REPL stores the evaluated expressions, remember. On my REPL, I can execute both of these 7 times before I run out of memory. Passing via the parallel executions uses extra memory temporarily, but once the toList is executed, that extra usage is garbage collected.
(1 to 100000).par.map(_+3)
returns a ParSeq[Int] (in this case a ParVector), which takes up more space than a normal Vector. This one I can execute 4 times before I run out of memory, whereas I can execute this:
(1 to 100000).map(_+3)
11 times before I run out of memory. So parallel collections, if you keep them around will take up more space.
As a workaround, you can transform them into simpler collections like a List before you return them.
As for why so much space is taken up by parallel collections and why it keeps references to so many things, I don't know, but I suspect views[*], and if you think it's a problem, raise an issue for it.
[*] without any real evidence.
回答4:
I had the same, but using a ThreadPool seems to get rid of the problem for me:
val threadPool = Executors.newFixedThreadPool(4)
val quadsMinPar = quadsMin.par
quadsMinPar.tasksupport = new ThreadPoolTaskSupport(threadPool.asInstanceOf[ThreadPoolExecutor])
ForkJoin for large collections might be creating too many threads.
来源:https://stackoverflow.com/questions/10847628/why-do-scala-parallel-collections-sometimes-cause-an-outofmemoryerror