I\'ve a Spark cluster with 10 nodes, and I\'m getting this exception after using the Spark Context for the first time:
14/11/20 11:15:13 ERROR UserGroupInfor
We had a similar problem which was quite hard to debug and isolate. Long story short - Spark uses Akka which is very picky about FQDN hostnames resolving to IP addresses. Even if you specify the IP Address at all places it is not enough. The answer here helped us isolate the problem.
A useful test to run is run netcat -l <port> on the master and run nc -vz <host> <port> on the worker to test the connectivity. Run the test with an IP address and with the FQDN. You can get the name Spark is using from the WARN message from the log snippet below. For us it was host032s4.staging.companynameremoved.info. The IP address test for us passed and the FQDN test failed as our DNS was not setup correctly.
INFO 2015-07-24 10:33:45 Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@10.40.246.168:35455]
INFO 2015-07-24 10:33:45 Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@10.40.246.168:35455]
INFO 2015-07-24 10:33:45 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 35455.
WARN 2015-07-24 10:33:45 Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@host032s4.staging.companynameremoved.info:50855]. Address is now gated for 60000 ms, all messages to this address will be delivered to dead letters.
ERROR 2015-07-24 10:34:15 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:skumar cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Another thing which we had to do was to specify the spark.driver.host and spark.driver.port properties in the spark submit script. This was because we had machines with two IP addresses and the FQDN resolved to the wrong IP address.
Make sure your network and DNS entries are correct!!
The Firewall was missconfigured and, in some instances, it didn't allowed the slaves to connect to the cluster. This generated the timeout issue, as the slaves couldn't connect to the server. If you are facing this timeout, check your firewall configs.
I had similar problem and I managed to get around it by using cluster deploy mode when submitting the application to Spark.
(Because even allowing all the incoming traffic to both my master and the single slave didn't allow me to use the client deploy mode. Before changing them I had default security group (AWS firewall) settings set up by Spark EC2 scripts from Spark 1.2.0).