We've had a small spark cluster running for a month now that's been successfully executing jobs or let me start up a spark-shell to the cluster.
It doesn't matter if I submit a job to the cluster or connect to it using the shell, the error is always the same.
root@~]$ $SPARK_HOME/bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/11/10 20:43:01 INFO spark.SecurityManager: Changing view acls to: root, 14/11/10 20:43:01 INFO spark.SecurityManager: Changing modify acls to: root, 14/11/10 20:43:01 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, ); users with modify permissions: Set(root, ) 14/11/10 20:43:01 INFO spark.HttpServer: Starting HTTP Server 14/11/10 20:43:01 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/11/10 20:43:01 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:60223 14/11/10 20:43:01 INFO util.Utils: Successfully started service 'HTTP class server' on port 60223. Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65) Type in expressions to have them evaluated. Type :help for more information. 14/11/10 20:43:05 INFO spark.SecurityManager: Changing view acls to: root, 14/11/10 20:43:05 INFO spark.SecurityManager: Changing modify acls to: root, 14/11/10 20:43:05 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, ); users with modify permissions: Set(root, ) 14/11/10 20:43:05 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/11/10 20:43:05 INFO Remoting: Starting remoting 14/11/10 20:43:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ip-10-237-182-163.ec2.internal:41369] 14/11/10 20:43:05 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@ip-10-237-182-163.ec2.internal:41369] 14/11/10 20:43:05 INFO util.Utils: Successfully started service 'sparkDriver' on port 41369. 14/11/10 20:43:05 INFO spark.SparkEnv: Registering MapOutputTracker 14/11/10 20:43:05 INFO spark.SparkEnv: Registering BlockManagerMaster 14/11/10 20:43:05 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-local-20141110204305-a4f0 14/11/10 20:43:05 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-local-20141110204305-991c 14/11/10 20:43:05 INFO util.Utils: Successfully started service 'Connection manager for block manager' on port 56708. 14/11/10 20:43:05 INFO network.ConnectionManager: Bound socket to port 56708 with id = ConnectionManagerId(ip-10-237-182-163.ec2.internal,56708) 14/11/10 20:43:05 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 14/11/10 20:43:05 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/11/10 20:43:05 INFO storage.BlockManagerMasterActor: Registering block manager ip-10-237-182-163.ec2.internal:56708 with 265.4 MB RAM 14/11/10 20:43:05 INFO storage.BlockManagerMaster: Registered BlockManager 14/11/10 20:43:05 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fa8cd9e8-5a4a-40a4-bc76-c2215886873e 14/11/10 20:43:05 INFO spark.HttpServer: Starting HTTP Server 14/11/10 20:43:05 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/11/10 20:43:05 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:36394 14/11/10 20:43:05 INFO util.Utils: Successfully started service 'HTTP file server' on port 36394. 14/11/10 20:43:06 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/11/10 20:43:06 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/11/10 20:43:06 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 14/11/10 20:43:06 INFO ui.SparkUI: Started SparkUI at http://ec2-54-91-220-90.compute-1.amazonaws.com:4040 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Connecting to master spark://ec2-54-91-220-90.compute-1.amazonaws.com:7077... 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 14/11/10 20:43:06 INFO repl.SparkILoop: Created spark context.. 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20141110204306-0389 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/0 on worker-20140929210658-ip-10-225-160-49.ec2.internal-60693 (ip-10-225-160-49.ec2.internal:60693) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/0 on hostPort ip-10-225-160-49.ec2.internal:60693 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/1 on worker-20140929210658-ip-10-147-28-32.ec2.internal-60731 (ip-10-147-28-32.ec2.internal:60731) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/1 on hostPort ip-10-147-28-32.ec2.internal:60731 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/2 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/2 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/2 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/1 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/2 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/2) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/2 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/2 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/3 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/3 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/0 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/3 is now RUNNING Spark context available as sc. scala> 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/3 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/3) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/3 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/3 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/4 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/4 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/4 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/4 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/4) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/4 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/4 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/5 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/5 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/5 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/5 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/5) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/5 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/5 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/6 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/6 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/6 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/6 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/6) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/6 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/6 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/7 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/7 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/7 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/7 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/7) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/7 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/7 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/8 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/8 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/8 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/8 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/8) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/8 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/8 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/9 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/9 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/9 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/9 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/9) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/9 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/9 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/10 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/10 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/10 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/10 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/10) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/10 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/10 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor added: app-20141110204306-0389/11 on worker-20140929210657-ip-10-69-165-231.ec2.internal-47794 (ip-10-69-165-231.ec2.internal:47794) with 4 cores 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20141110204306-0389/11 on hostPort ip-10-69-165-231.ec2.internal:47794 with 4 cores, 12.4 GB RAM 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/11 is now RUNNING 14/11/10 20:43:06 INFO client.AppClient$ClientActor: Executor updated: app-20141110204306-0389/11 is now FAILED (java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/11) 14/11/10 20:43:06 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141110204306-0389/11 removed: java.io.IOException: Failed to create directory /root/spark/work/app-20141110204306-0389/11 14/11/10 20:43:06 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED 14/11/10 20:43:06 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED`
If however I only run it on local mode, it connects and runs fine.
I'm leaning towards this being a permission error of some sort but we haven't touched anything this whole time it's been working.
EDIT
After a little more digging I discovered that the worker nodes were out of disk space. In the work folder it turns out it's storing the jar that is copied over as well as a stdout and stderr file for that job. Is there anyway to have it delete these when done with it as we have logging setup for the jobs to go to S3.