Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

后端 未结 4 1996
自闭症患者
自闭症患者 2020-12-06 08:00

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn\'t seem to be documented in the cluster deployment documentation. Using the Spark

4条回答
  •  离开以前
    2020-12-06 08:31

    Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.

    The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.

    Here's an (arbitrary) example with the key reference:

    config <- spark_config()
    config$spark.driver.cores <- 32
    config$spark.executor.cores <- 32
    config$spark.executor.memory <- "40g"
    
    library(sparklyr)
    
    Sys.setenv(SPARK_HOME = "/usr/local/spark")
    Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
    Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
    
    config <- spark_config()
    config$spark.executor.instances <- 4
    config$spark.executor.cores <- 4
    config$spark.executor.memory <- "4G"
    
    sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
    

    R Bloggers Link to Article

提交回复
热议问题