Temporarily installing R packages on Hadoop nodes for streaming jobs

主宰稳场 提交于 2020-01-02 08:42:17

问题


I have access to a Hadoop cluster that has base R (2.14.1) but no additional packages installed in every node. I've been writing base R mapper and reducer streaming scripts to get around the fact that I have no additional packages. However, I've come to a point where I need to use certain packages, rjson mainly, as part of my scripts.

I don't have admin privileges on the cluster, and the user accounts are fairly restricted. Having the cluster admins install the package on every node is not an option (for now), and the cluster has no external internet access.

I've uploaded the rjson_0.2.8.tar.gz source file to my gateway node. Is it possible to install R packages temporarily by adding install.packages("rjson_0.2.8.tar.gz", repos = NULL, lib = /tmp) or something along those lines, such that the package is intalled when the script starts, and pass the source via the -cacheArchive parameter of the streaming job? I'd like the package to be installed in a temp location such that it dissapears when the job is complete.

Is this even possible?

I know I'll get some "use python" answers since it's for processing JSON, which is an option, but the question is for any package. :)


回答1:


I am the author of rmr (project RHadoop). We are experimenting with a pretty radical approach to side step the installation issue. We package the whole R distribution, packages and everything in a jar, using the streaming features as you describe but with one degree of indirectness. The R distribution is loaded to a user hdfs directory, not a tmp directory. Streaming then moves it to each node. The job itself will move it to its final destination whenever it's not present already. We did so because the whole distro is not tiny and we wanted to take advantage of the caching features of streaming, plus components of R are not relocatable. So you would rebuild the jar and move it to hdfs whenever you update something or add a package. The rest is automatic and happens only when needed (hdfs->nodes->final location). I even got some coaching from the Hortonworks guys to do it right. We have a proof of concept in the branch 0-install, but it works only for ubuntu/EC2 and apparently I managed to hard code some paths that I shouldn't have and I am making a number of other assumptions, so this is only for developers willing to chip in, but the main ingredients are all in place. Of course this is conditional to you writing your jobs with rmr, which is a separate decision, or you could just take a look at the code and reproduce the approach for your purposes. But I'd rather have this solved once and for all for everybody. The script preparing the jar is this: https://github.com/RevolutionAnalytics/RHadoop/blob/0-install/rmr/pkg/tools/0-install/setup-jar and the rest of the action is in rmr:::rhstream




回答2:


You should be able to do as you suggest with the -cacheArchive argument - but note that this has been deprecated and you should be using -archives instead.

Another point to note, using -archives over -files will mean that your tar.gz file will be unpacked by the task tracker (rather than you having to manually unpack it).

Either way the file or unpacked files will be available in the current working directory when your code executes, and from there you'll be able to register the modules packages as per the mechanisms available in R (i've never used R, so you're on your own from here)




回答3:


You may create a temporary directory (e.g., using tempfile from R or mkdtemp from python). Make sure the name of the directory is unique, otherwise R will report error when multiple mappers simutaneously install packages to a same location. This temporary directory can be used as the library location for install.packages. The directory is in the location defined by mapred.child.tmp property. Under default setup, it will be removed after the corresponding mapper completes. You can also set mapred.child.tmp to a particular location (e.g., -D mapred.child.tmp=/tmp/), but Hadoop may not delete the temporary directory.



来源:https://stackoverflow.com/questions/11143406/temporarily-installing-r-packages-on-hadoop-nodes-for-streaming-jobs

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!