Manually calling spark's garbage collection from pyspark

问题

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.

Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:

import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected

This has helped a little.

I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.

Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.

回答1:

You never have to call manually the GC. If you had OOMException it's because there is no more memory available. You should look for memory leak, aka references you keep in your code. If you releases this references the JVM will make free space when needed.

回答2:

I believe this will trigger a GC (hint) in the JVM:

spark.sparkContext._jvm.System.gc()

See also: How to force garbage collection in Java?

and: Java: How do you really force a GC using JVMTI's ForceGargabeCollection?

回答3:

This is not yet possible, there are some tickets about executing "management task" on all executors:

https://issues.apache.org/jira/browse/SPARK-650
https://issues.apache.org/jira/browse/SPARK-636

But no completed yet.

You can try to call JVM GC when executing worker code, this will work. For exemple, when doing a RDD map, but I am sure with a right tuning you can get rid of OOM.

The most important setting is about the fraction you give between Java Heap and RDD cache memory: spark.memory.fraction, sometimes it's better to set to a very low value (such as 0.1), sometimes increase it.

More info at https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview

来源：https://stackoverflow.com/questions/33689536/manually-calling-sparks-garbage-collection-from-pyspark

标签

java

python

apache-spark

garbage-collection

pyspark