how to use spark_apply_bundle

这一生的挚爱 提交于 2019-12-11 00:59:10

问题


I am trying to use spark_apply_bundle to limit the number of packages/data transferred to the worker nodes on a YARN managed cluster. As mentioned in here I must pass the path of the tarball to spark_apply as the packages argument and I also must make it available via "sparklyr.shell.files" in the spark config.

My questions are:

  • Can the path to the tarball be relative to the project's working directory, if not then should it be stored on hdfs or somewhere else?
  • What should be passed to "sparklyr.shell.files"? Is it a duplicate of the path passed to spark_apply?

Currently my unsuccessful script look something like this:

bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")

...

config$sparklyr.shell.files <- bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)

回答1:


The spark job succeeded by copying the tarball to hdfs. It seems as if it's plausible to use some other method (e.g. copying the file to each worker node) but this seems to be the easiest solution.

The updated script looks as follows:

bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")

...

hdfs_path <- "hdfs://nn.example.com/some/directory/"
hdfs_bundle <- paste0(hdfs_path, basename(bundle))
system(paste("hdfs dfs -put", bundle, hdfs_path))
config$sparklyr.shell.files <- hdfs_bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)


来源:https://stackoverflow.com/questions/49717181/how-to-use-spark-apply-bundle

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!