should I pre-install cran r packages on worker nodes when using sparkr

若如初见. 提交于 2019-12-30 10:28:14

问题


I want to use r packages on cran such as forecast etc with sparkr and meet following two problems.

  1. Should I pre-install all those packages on worker nodes? But when I read the source code of spark this file, it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. What should I do to make the dependencies available on workers?

  2. Suppose I need to use functions provided by forecast in a map transformation, how should I import the package. Do I need to do something like following, import the package in the map function, will it make multiple import: SparkR:::map(rdd, function(x){ library(forecast) then do other staffs })

Update:

After reading more source code, it seems that, I can use includePackage to include packages on worker nodes according to this file. So now the problem becomes is it right that I have to pre-install the packages on nodes manually? And if that's true, what's the use case for --jars and --packages described in question 1? If that's wrong, how to use --jars and --packages to install the packages?


回答1:


It is boring to repeat this but you shouldn't use internal RDD API in the first place. It's been removed in the first official SparkR release and it is simply not suitable for general usage.

Until new low level API* is ready (see for example SPARK-12922 SPARK-12919, SPARK-12792) I wouldn't consider Spark as a platform for running plain R code. Even when it changes adding native (Java / Scala) code with R wrappers can be a better choice.

That being said lets start with your question:

  1. RPackageUtils are designed to handle packages create with Spark Packages in mind. It doesn't handle standard R libraries.
  2. Yes, you need packages to be installed on every node. From includePackage docstring:

    The package is assumed to be installed on every node in the Spark cluster.


* If you use Spark 2.0+ you can use dapply, gapply and lapply functions.




回答2:


Add libraries works with spark 2.0+. For example, I am adding the package forecast in all node of cluster. The code works with Spark 2.0+ and databricks environment.

schema <- structType(structField("out", "string"))
out <- gapply(
  df,
  c("p", "q"),
  function(key, x) 
  if (!all(c("forecast") %in% (.packages()))){
     if (!require("forecast")) {
        install.packages("forecast", repos ="http://cran.us.r-project.org", INSTALL_opts = c('--no-lock'))
     }
  }  
  #use forecast
  #dataframe out
  data.frame(out = x$column, stringAsFactor = FALSE)
}, 
schema)



回答3:


a better choice is to pass your local R package by spark-submit archive option, which means you do not need install R package in each worker and do not install and compile R package while running SparkR::dapply for time consuming waiting. for example:

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 40 --executor-cores 10 --executor-memory 8G --driver-memory 512M --jars /usr/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.11.1.jar --files /etc/hive/conf/hive-site.xml --archives /your_R_packages/3.5.zip --files xgboost.model sparkr-shell")

when call SparkR::dapply function, let it call .libPaths("./3.5.zip/3.5") first. And you need notice that the server version R version must be equal your zip file R version.



来源:https://stackoverflow.com/questions/36001256/should-i-pre-install-cran-r-packages-on-worker-nodes-when-using-sparkr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!