Why do Scala parallel collections sometimes cause an OutOfMemoryError?

て烟熏妆下的殇ゞ 提交于 2020-01-13 02:13:11

问题


This takes around 1 second

(1 to 1000000).map(_+3)

While this gives java.lang.OutOfMemoryError: Java heap space

(1 to 1000000).par.map(_+3)

EDIT:

I have standard scala 2.9.2 configuration. I am typing this on scala prompt. And in the bash i can see [ -n "$JAVA_OPTS" ] || JAVA_OPTS="-Xmx256M -Xms32M"

AND i dont have JAVA_OPTS set in my env.

1 million integers = 8MB, creating list twice = 16MB


回答1:


It seems definitely related to the JVM memory option and to the memory required to stock a Parralel collection. For example:

scala> (1 to 1000000).par.map(_+3)

ends up with a OutOfMemoryError the third time I tried to evaluate it, while

scala> (1 to 1000000).par.map(_+3).seq

never failed. The issue is not the computation its the storage of the Parrallel collection.




回答2:


Several reasons for the failure:

  1. Parallel collections are not specialized, so the objects get boxed. This means that you can't multiply the number of elements with 8 to get the memory usage.
  2. Using map means that the range is converted into a vector. For parallel vectors an efficient concatenation has not been implemented yet, so merging intermediate vectors produced by different processors proceeds by copying - requiring more memory. This will be addressed in future releases.
  3. The REPL stores previous results - the object evaluated in each line remains in memory.



回答3:


There are two issues here, the amount of memory required to store a parallel collection and the amount of memory required to 'pass through' a parallel collection.

The difference can be seen between these two lines:

(1 to 1000000).map(_+3).toList
(1 to 1000000).par.map(_+3).toList

The REPL stores the evaluated expressions, remember. On my REPL, I can execute both of these 7 times before I run out of memory. Passing via the parallel executions uses extra memory temporarily, but once the toList is executed, that extra usage is garbage collected.

(1 to 100000).par.map(_+3)

returns a ParSeq[Int] (in this case a ParVector), which takes up more space than a normal Vector. This one I can execute 4 times before I run out of memory, whereas I can execute this:

(1 to 100000).map(_+3)

11 times before I run out of memory. So parallel collections, if you keep them around will take up more space.

As a workaround, you can transform them into simpler collections like a List before you return them.

As for why so much space is taken up by parallel collections and why it keeps references to so many things, I don't know, but I suspect views[*], and if you think it's a problem, raise an issue for it.

[*] without any real evidence.




回答4:


I had the same, but using a ThreadPool seems to get rid of the problem for me:

  val threadPool = Executors.newFixedThreadPool(4)
  val quadsMinPar = quadsMin.par
  quadsMinPar.tasksupport = new ThreadPoolTaskSupport(threadPool.asInstanceOf[ThreadPoolExecutor])

ForkJoin for large collections might be creating too many threads.



来源:https://stackoverflow.com/questions/10847628/why-do-scala-parallel-collections-sometimes-cause-an-outofmemoryerror

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!