org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

匿名 (未验证) 提交于 2019-12-03 09:05:37

问题:

Getting the below error with respect to the container while submitting an spark application to YARN. The HADOOP(2.7.3)/SPARK (2.1) environment is running a pseudo-distributed mode in a single node cluster. The application works perfectly when made to run in local model however trying to check its correctness in a cluster mode using YARN as RM and hit some roadblock. New to this world hence looking for help.

--- Applications logs

2017-04-11 07:13:28 INFO  Client:58 - Submitting application 1 to ResourceManager 2017-04-11 07:13:28 INFO  YarnClientImpl:174 - Submitted application application_1491909036583_0001 to ResourceManager at /0.0.0.0:8032 2017-04-11 07:13:29 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED) 2017-04-11 07:13:29 INFO  Client:58 -       client token: N/A      diagnostics: N/A      ApplicationMaster host: N/A      ApplicationMaster RPC port: -1      queue: default      start time: 1491909208425      final status: UNDEFINED      tracking URL: http://ip-xxx.xx.xx.xxx:8088/proxy/application_1491909036583_0001/      user: xxxx 2017-04-11 07:13:30 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED) 2017-04-11 07:13:31 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED) 2017-04-11 07:13:32 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED) 2017-04-11 07:17:37 INFO  Client:58 - Application report for application_1491909036583_0001 (state: FAILED) 2017-04-11 07:17:37 INFO  Client:58 -       client token: N/A      diagnostics: Application application_1491909036583_0001 failed 2 times due to AM Container for appattempt_1491909036583_0001_000002 exited with  exitCode: 10 For more detailed output, check application tracking page:http://"hostname":8088/cluster/app/application_1491909036583_0001Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1491909036583_0001_02_000001 Exit code: 10 Stack trace: ExitCodeException exitCode=10:      at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)     at org.apache.hadoop.util.Shell.run(Shell.java:479)     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)     at java.util.concurrent.FutureTask.run(FutureTask.java:266)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)     at java.lang.Thread.run(Thread.java:745) 

****--- Container Logs****

2017-04-11 07:13:30 INFO  ApplicationMaster:47 - Registered signal handlers for [TERM, HUP, INT] 2017-04-11 07:13:31 INFO  ApplicationMaster:59 - ApplicationAttemptId: appattempt_1491909036583_0001_000001 2017-04-11 07:13:32 INFO  SecurityManager:59 - Changing view acls to: root,xxxx 2017-04-11 07:13:32 INFO  SecurityManager:59 - Changing modify acls to: root,xxxx 2017-04-11 07:13:32 INFO  SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, xxxx); users with modify permissions: Set(root, xxxx) 2017-04-11 07:13:32 INFO  Slf4jLogger:80 - Slf4jLogger started 2017-04-11 07:13:32 INFO  Remoting:74 - Starting remoting 2017-04-11 07:13:32 INFO  Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkYarnAM@xxx.xx.xx.xxx:45446] 2017-04-11 07:13:32 INFO  Remoting:74 - Remoting now listens on addresses: [akka.tcp://sparkYarnAM@xxx.xx.xx.xxx:45446] 2017-04-11 07:13:32 INFO  Utils:59 - Successfully started service 'sparkYarnAM' on port 45446. 2017-04-11 07:13:32 INFO  ApplicationMaster:59 - Waiting for Spark driver to be reachable. 2017-04-11 07:13:32 INFO  ApplicationMaster:59 - Driver now available: xxx.xx.xx.xxx:47503 2017-04-11 07:15:32 ERROR ApplicationMaster:96 - Uncaught exception:  org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:98)     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:116)     at org.apache.spark.deploy.yarn.ApplicationMaster.runAMEndpoint(ApplicationMaster.scala:279)     at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:473)     at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:315)     at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:157)     at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:625)     at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)     at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)     at java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)     at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)     at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:623)     at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:646)     at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]     at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)     at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)     at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)     at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)     at scala.concurrent.Await$.result(package.scala:107)     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)     ... 16 more 2017-04-11 07:15:32 INFO  ApplicationMaster:59 - Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout) 2017-04-11 07:15:32 INFO  ShutdownHookManager:59 - Shutdown hook called 

--Yarn Node Manager logs at the time of failure

2017-04-11 07:15:18,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used 2017-04-11 07:15:21,735 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used 2017-04-11 07:15:24,742 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used 2017-04-11 07:15:27,749 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used 2017-04-11 07:15:30,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used 2017-04-11 07:15:33,018 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1491909036583_0001_01_000001 is : 10 2017-04-11 07:15:33,019 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1491909036583_0001_01_000001 and exit code: 10 ExitCodeException exitCode=10:      at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) 

-- SparkCOntext parameters

<!-- Spark Configuration --> <bean id="sparkInfo" class="SparkInfo">     <property name="appName" value="framework"></property>     <property name="master" value="yarn-client"></property>     <property name="dynamicAllocation" value="false"></property>     <property name="executorInstances" value="2"></property>     <property name="executorMemory" value="1g"></property>     <property name="executorCores" value="4"></property>     <property name="executorCoresMax" value="2"></property>     <property name="taskCpus" value="4"></property>     <property name="executorClassPath" value="/usr/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib/*"></property>     <property name="yarnJar"         value="${framework.hdfsURI}/app/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar"></property>     <property name="yarnQueue" value="default"></property>     <property name="memoryFraction" value="0.4"></property> </bean> 

sparks.default.conf

spark.driver.memory              1g spark.executor.extraJavaOptions   -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m spark.rpc.lookupTimeout          600s 

yarn-site.xml

<!-- Site specific YARN configuration properties -->   <property>         <name>yarn.nodemanager.aux-services</name>         <value>mapreduce_shuffle</value>   </property>   <property>     <name>yarn.scheduler.minimum-allocation-mb</name>     <value>1024</value>   </property>   <property>     <name>yarn.scheduler.maximum-allocation-mb</name>     <value>3096</value>   </property>   <property>     <name>yarn.nodemanager.resource.memory-mb</name>     <value>3096</value>   </property>   <property>     <name>yarn.nodemanager.vmem-pmem-ratio</name>     <value>4</value>   </property> </configuration> 

回答1:

You can keep increasing spark.network.timeout until you stop seeing the problem , as mentioned by himanshuIIITian in comment.
When spark is under heavy workload, timeout exception can occur. If you have low executor memory then GC may keep system very busy which increases workload. Look into the logs if there is Out Of Memory error. Please enable -XX:+PrintGCDetails -XX:+PrintGCTimeStamps in spark.executor.extraJavaOptions and look into logs if full GC is invoked a number of times before a task completes. If that is the case then increase your executorMemory . That should hopefully solve your problem.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!