Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn\'t seem to be documented in the cluster deployment documentation. Using the Spark
Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.
The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.
Here's an (arbitrary) example with the key reference:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
R Bloggers Link to Article