Spark driver disassociated and removed by the master

匿名 (未验证) 提交于 2019-12-03 08:41:19

问题:

I have a cluster made by two slaves and one master and set up and I submit a jar (scala) to the spark master (192.168.1.64):

spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar 

After quite sometime running just fine it stops abruptly with the last lines on the terminal being

... 15/08/19 17:45:24 INFO scheduler.TaskSchedulerImpl: Adding task set 411292.0 with 6 tasks 15/08/19 17:45:24 WARN scheduler.TaskSetManager: Stage 411292 contains a task of very large size (2762 KB). The maximum recommended task size is 100 KB. 15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 411292.0 (TID 1832, 192.168.1.64, PROCESS_LOCAL, 2828792 bytes) 15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 411292.0 (TID 1833, 192.168.1.62, PROCESS_LOCAL, 2310009 bytes) 15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 411292.0 (TID 1834, 192.168.1.64, PROCESS_LOCAL, 2669188 bytes) 15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 411292.0 (TID 1835, 192.168.1.62, PROCESS_LOCAL, 2295676 bytes) 15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 411292.0 (TID 1836, 192.168.1.64, PROCESS_LOCAL, 2847786 bytes) 15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 411292.0 (TID 1837, 192.168.1.64, PROCESS_LOCAL, 2913528 bytes) Killed 

and the error occurring at the master log is the following:

... 15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640 15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007 15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver@192.168.1.64:46823 got disassociated, removing it. 15/08/19 16:15:44 INFO master.Master: Removing app app-20150819160949-0001 15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/08/19 16:15:44 WARN master.Master: Application testPageRank is still in progress, it may be terminated abnormally. ... 

Both workers have in their logs something like this

... 15/08/19 16:15:49 INFO worker.Worker: Executor app-20150819160949-0001/0 finished with state EXITED message Command exited with code 1 exitStatus 1 15/08/19 16:15:50 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.1.64:54799] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 

and

... 15/08/19 16:15:43 INFO worker.Worker: Executor app-20150819160949-0001/1 finished with state EXITED message Command exited with code 1 exitStatus 1 15/08/19 16:15:43 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.1.62:53325] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 

respectively. The work/app files contain something like this

... 15/08/19 16:15:41 INFO executor.Executor: Finished task 1.0 in stage 387758.0 (TID 1803). 1911 bytes result sent to driver 15/08/19 16:15:41 INFO executor.Executor: Finished task 4.0 in stage 387758.0 (TID 1806). 1911 bytes result sent to driver 15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_5 locally 15/08/19 16:15:41 INFO executor.Executor: Finished task 5.0 in stage 387758.0 (TID 1807). 1911 bytes result sent to driver 15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_3 locally 15/08/19 16:15:41 INFO executor.Executor: Finished task 3.0 in stage 387758.0 (TID 1805). 1911 bytes result sent to driver 15/08/19 16:15:44 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down. 15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/08/19 16:15:45 INFO storage.DiskBlockManager: Shutdown hook called 15/08/19 16:15:46 INFO util.Utils: Shutdown hook called 

and

... 15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_0 locally 15/08/19 16:15:41 INFO executor.Executor: Finished task 2.0 in stage 387758.0 (TID 1804). 1911 bytes result sent to driver 15/08/19 16:15:41 INFO executor.Executor: Finished task 0.0 in stage 387758.0 (TID 1802). 1911 bytes result sent to driver 15/08/19 16:15:42 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down. 15/08/19 16:15:42 INFO storage.DiskBlockManager: Shutdown hook called 15/08/19 16:15:42 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/08/19 16:15:42 INFO util.Utils: Shutdown hook called 

respectively. There seem to be no other error in hdfs or spark.

I am suspecting that the error lies in the master log, the third line (15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver@192.168.1.64:46823 got disassociated, removing it.) but I can't figure out why. I tried changing the spark.akka.heartbeat.interval to 100 as suggested in some posts but no luck. Anyone would know why it happens and how to solve this? Thanks so much.

回答1:

As mentioned in a very similar question here WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated]

The problem is likely to be the lack of memory. Adding more memory (or in that case more nodes) should solve the problem.

(Alternately, needing less memory should work too of course).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!