问题
I am running a Spark application that processes multiple sets of data points; some of these sets need to be processed sequentially. When running the application for small sets of data points (ca. 100), everything works fine. But in some cases, the sets will have a size of ca. 10,000 data points, and those cause the worker to crash with the following stack trace:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 36, 10.40.98.10, executor 1): java.io.FileNotFoundException: /tmp/spark-5198d746-6501-4c4d-bb1c-82479d5fd48f/executor-a1d76cc1-a3eb-4147-b73b-29742cfd652d/blockmgr-d2c5371b-1860-4d8b-89ce-0b60a79fa394/3a/temp_shuffle_94d136c9-4dc4-439e-90bc-58b18742011c (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:102)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:115)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:235)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have checked all log files after multiple instances of this error, but did not find any other error messages.
Searching the internet for this problem, I have found two potential causes that do not seem to be applicable to my situation:
- The user running the Spark process does not have read/write permission in the
/tmp/directory.- Seeing as the error occurs only for larger datasets (instead of always), I do not expect this to be the problem.
- The
/tmp/directory does not have enough space for shuffle files (or other temporary Spark files).- The
/tmp/directory on my system has about 45GB available, and the amount of data in a single data point (< 1KB) means that this is also probably not the case.
- The
I have been flailing at this problem for a couple of hours, trying to find work-arounds and possible causes.
- I have tried reducing the cluster (which is normally two machines) to a single worker, running on the same machine as the driver, in the hope that this would eliminate the need for shuffles and thus prevent this error. This did not work; the error occurs in exactly the same way.
- I have isolated the problem to an operation that processes a dataset sequentially through a tail-recursive method.
What is causing this problem? How can I go about determining the cause myself?
回答1:
The problem turns out to be a stack overflow (ha!) occurring on the worker.
On a hunch, I rewrote the operation to be performed entirely on the driver (effectively disabling Spark functionality). When I ran this code, the system still crashed, but now displayed a StackOverflowError. Contrary to what I previously believed, apparently tail-recursive methods can definitely cause a stack overflow just like any other form of recursion. After rewriting the method to no longer use recursion, the problem disappeared.
A stack overflow is probably not the only problem that can produce the original FileNotFoundException, but making a temporary code change which pulls the operation to the driver seems to be a good way to determine the actual cause of the problem.
来源:https://stackoverflow.com/questions/46825569/spark-worker-throws-filenotfoundexception-on-temporary-shuffle-files