Databricks Job timed out with error : Lost executor 0 on [IP]. Remote RPC client disassociated

问题

Complete error : Databricks Job timed out with error : Lost executor 0 on [IP]. Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

We are running jobs using Jobs API 2.0 on Azure Databricks subscription and using the Pools interface for less spawn time and using the worker/driver as Standard_DS12_v2.

We have a job(JAR main) which has just one SQL procedure call. This call takes more than 1.2 hour to complete. After exactly 1 hour the worker node is getting terminated and the Job status becomes timed out. We thought this could be because the node was indeed idle in that 1 hour time span and hence added a sniffer thread which is continuously logging every 10 minutes. This has not resolved the issue. Please find the logs below :

20/01/16 10:49:43 INFO StaticConf$: DB_HOME: /databricks
20/01/16 10:49:43 INFO DriverDaemon$: ========== driver starting up ==========
20/01/16 10:49:43 INFO DriverDaemon$: Java: Private Build 1.8.0_232
20/01/16 10:49:43 INFO DriverDaemon$: OS: Linux/amd64 4.15.0-1050-azure
20/01/16 10:49:43 INFO DriverDaemon$: CWD: /databricks/driver
20/01/16 10:49:43 INFO DriverDaemon$: Mem: Max: 17.5G loaded GCs: PS Scavenge, PS MarkSweep
20/01/16 10:49:43 INFO DriverDaemon$: Logging multibyte characters: ✓
20/01/16 10:49:43 INFO DriverDaemon$: 'publicFile' appender in root logger: class com.databricks.logging.RedactionRollingFileAppender
20/01/16 10:49:43 INFO DriverDaemon$: 'org.apache.log4j.Appender' appender in root logger: class com.codahale.metrics.log4j.InstrumentedAppender
20/01/16 10:49:43 INFO DriverDaemon$: 'null' appender in root logger: class com.databricks.logging.RequestTracker
20/01/16 10:49:43 INFO DriverDaemon$: == Modules:
20/01/16 10:49:44 INFO DriverDaemon$: Starting prometheus metrics log export timer
20/01/16 10:49:44 INFO DriverDaemon$: Universe Git Hash: 422793c171cb2855a8f424d226006093e5349873
20/01/16 10:49:44 INFO DriverDaemon$: Spark Git Hash: 0c5791fc51d5c2b434155df16049c9f78e12e8fb
20/01/16 10:49:44 WARN RunHelpers$: Missing tag isolation client: java.util.NoSuchElementException: key not found: TagDefinition(clientType,The client type for a request, used for isolating resources for the request.)
20/01/16 10:49:44 INFO DatabricksILoop$: Creating throwaway interpreter
20/01/16 10:49:44 INFO SparkConfUtils$: Customize spark config according to file /tmp/custom-spark.conf
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.databricks.delta.preview.enabled -> true
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.network.timeout -> 4000
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.driver.host -> 10.30.2.205
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.executor.tempDirectory -> /local_disk0/tmp
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.databricks.secret.envVar.keys.toRedact -> 
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.driver.tempDirectory -> /local_disk0/tmp
20/01/16 10:49:44 INFO SparkConfUtils$: new spark config: spark.databricks.secret.sparkConf.keys.toRedact -> 


20/01/16 11:49:32 INFO DriverCorral$: Cleaning the wrapper ReplId-20fe8-56e17-17323-1 (currently in status Running(ReplId-20fe8-56e17-17323-1,ExecutionId(job-368-run-1-action-368),RunnableCommandId(6227870104230535817)))
20/01/16 11:49:32 INFO DAGScheduler: Asked to cancel job group 2377484361178493489_6227870104230535817_job-368-run-1-action-368
20/01/16 11:49:32 INFO ScalaDriverLocal: cancelled jobGroup:2377484361178493489_6227870104230535817_job-368-run-1-action-368 
20/01/16 11:49:32 INFO ScalaDriverWrapper: Stopping streams for commandId pattern: CommandIdPattern(2377484361178493489,None,Some(job-368-run-1-action-368)).
20/01/16 11:49:35 ERROR TaskSchedulerImpl: Lost executor 0 on 10.30.2.208: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
20/01/16 11:49:35 INFO DAGScheduler: Executor lost: 0 (epoch 1)
20/01/16 11:49:35 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
20/01/16 11:49:35 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, 10.30.2.208, 41985, None)
20/01/16 11:49:35 INFO DBCEventLoggingListener: Rolling event log; numTimesRolledOver = 1
20/01/16 11:49:35 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
20/01/16 11:49:35 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200116104947-0000/0 is now LOST (worker lost)
20/01/16 11:49:35 INFO StandaloneSchedulerBackend: Executor app-20200116104947-0000/0 removed: worker lost
20/01/16 11:49:35 INFO BlockManagerMaster: Removal of executor 0 requested
20/01/16 11:49:35 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
20/01/16 11:49:35 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
20/01/16 11:49:35 INFO DBCEventLoggingListener: Rolled active log file /databricks/driver/eventlogs/5656882020603523684/eventlog to /databricks/driver/eventlogs/5656882020603523684/eventlog-2020-01-16--11-00
20/01/16 11:49:35 INFO StandaloneAppClient$ClientEndpoint: Master removed worker worker-20200116104954-10.30.2.208-38261: 10.30.2.208:38261 got disassociated
20/01/16 11:49:35 INFO DBCEventLoggingListener: Logging events to eventlogs/5656882020603523684/eventlog
20/01/16 11:49:35 INFO StandaloneSchedulerBackend: Worker worker-20200116104954-10.30.2.208-38261 removed: 10.30.2.208:38261 got disassociated
20/01/16 11:49:35 INFO TaskSchedulerImpl: Handle removed worker worker-20200116104954-10.30.2.208-38261: 10.30.2.208:38261 got disassociated
20/01/16 11:49:35 INFO DAGScheduler: Shuffle files lost for worker worker-20200116104954-10.30.2.208-38261 on host 10.30.2.208

On the Cluster job page, we can see the event log as :

On the Job-run page, we can see the status as Timed Out,

As we can see in the logs: