java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge

问题

I run the following python script with spark-submit,

r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list)
r_labeled = r.map(f_0).flatMap(f_1)
r_labeled.map(lambda x: x[3]).collect()

It gets java.lang.OutOfMemoryError, specifically on the collect() action of the last line,

java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
17/11/08 08:27:31 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 6,5,main]
java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
17/11/08 08:27:31 INFO SparkContext: Invoking stop() from shutdown hook
17/11/08 08:27:31 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 6, localhost, executor driver): java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The message says OutOfMemoryError but nothing else. Is it about heap, garbage collection or anything? I don't know.

Anyway, I tried to config everything about memory to huge value.

spark.driver.maxResultSize = 0 # no limit
spark.driver.memory = 150g
spark.executor.memory = 150g
spark.worker.memory = 150g

(And the server has 157g physical memory available.)

Still the same error is there.

Then I reduced the input data a little bit, and the code passed perfectly every time. In fact, the data got by collect() is about 1.8g, far more smaller than the physical 15g memory.

Now, I am sure the error is not about the code and physical memory is no limit. It is like there is a threshold for the size of input data, and passing it will cause out-of-memory error.

So how can I lift this thresold so that I can handle bigger input data withou memory error? Any settings?

Thanks.

========== follow up ============

According to this, this error is related to Java Serializer and big object in MAP transformation. I did use big object in my code. Wondering how to get Java Serializer accommodate big object.

回答1:

First of all, it makes sense that you only get a problem when you call the collect method. Spark is lazy. Therefore it does absolutely nothing until you send data to the driver (collect, reduce, count...) or to the disk (write, save...).

Then it seems that you get an out of memory exception on an executor. What I understand from the stack trace is that your groupBy is creating an array of a size that exceeds the defined capacity (Integer.MAX_VALUE - 5 according to this). Would it be possible that a given key appears more than 2 billion times in your dataset?

In any case, I am not sure to understand what you are trying to do, but if you can, try to replace groupBy by a reduce operation that will put less stress on the memory.

Finally you gave 150g to the driver and to each executor although you have only 150g for everyone. I do not know who got what in your case. Try to share your memory reasonably and tell us what happens.

Hope this helps.

来源：https://stackoverflow.com/questions/47164588/java-lang-outofmemoryerror-in-rdd-collect-when-all-memory-setting-are-set-to-h

标签

python

apache-spark

pyspark