java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0

匿名 (未验证) 提交于 2019-12-03 03:03:02

问题:

I'm invoking Pyspark with Spark 2.0 in local mode with the following command:

pyspark --executor-memory 4g --driver-memory 4g 

The input dataframe is being read from a tsv file and has 580 K x 28 columns. I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error.

df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t') 

Any pointers how to get rid of this error. I can easily display the df or count the rows.

The output dataframe is 3100 rows with 23 columns

Error:

Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows     at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)     at org.apache.spark.scheduler.Task.run(Task.scala:85)     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)     at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0     at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)     at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)     at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)     at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)     at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)     at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)     at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)     at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)     at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)     at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)     at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)     at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)     at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)     at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)     at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)     at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)     at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)     at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)     at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)     at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)     at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)     ... 8 more  Driver stacktrace: 

回答1:

The problem for me was indeed coalesce(). What I did was exporting the file not using coalesce() but parquet instead using df.write.parquet("testP"). Then read back the file and export that with coalesce(1).

Hopefully it works for you as well.



回答2:

I believe that the cause of this problem is coalesce(), which despite the fact that it avoids a full shuffle (like repartition would do), it has to shrink the data in the requested number of partitions.

Here, you are requesting all the data to fit into one partition, thus one task (and only one task) has to work with all the data, which may cause its container to suffer from memory limitations.

So, either ask for more partitions than 1, or avoid coalesce() in this case.


Otherwise, you could try the solutions provided in the links below, for increasing your memory configurations:

  1. Spark java.lang.OutOfMemoryError: Java heap space
  2. Spark runs out of memory when grouping by key


回答3:

In my case the driver was smaller than the workers. Issue was resolved by making the driver larger.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!