How to import a custom module in a MapReduce job?

时光总嘲笑我的痴心妄想 提交于 2019-12-03 01:11:38

I posted the question to Hadoop user list and finally found the answer. It turns out that Hadoop doesn't really copy files to the location where the command runs, but instead creates symlinks for them. Python, in its turn, can't work with symlinks and thus doesn't recognize lib.py as Python module.

Simple workaround here is to put both main.py and lib.py into the same directory, so that symlink to the directory is placed into MR job working directory, while both files are physically in the same directory. So I did the following:

  1. Put main.py and lib.py into app directory.
  2. In main.py I used lib.py directly, that is, import string is just

    import lib

  3. Uploaded app directory with -files option.

So, final command looks like this:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files app 
       -mapper "app/main.py map" -reducer "app/main.py reduce" 
       -input input -output output 

When Hadoop-Streaming starts the python scripts, your python script's path is where the script file really is. However, hadoop starts them at './', and your lib.py(it's a symlink) is at './', too. So, try to add 'sys.path.append("./")' before you import lib.py like this: import sys sys.path.append('./') import lib

The -files and -archive switches are just shortcuts to Hadoop's distributed cache (DC), a more general mechanism that also allows to upload and automatically unpack archives in the zip, tar and tgz/tar.gz formats. If rather than by a single module your library is implemented by a structured Python package, the latter feature is what you want.

We are directly supporting this in Pydoop since release 1.0.0-rc1, where you can simply build a mypkg.tgz archive and run your program as:

pydoop submit --upload-archive-to-cache mypkg.tgz [...]

The relevant docs are at http://crs4.github.io/pydoop/self_contained.html and here is a full working example (requires wheel): https://github.com/crs4/pydoop/tree/master/examples/self_contained.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!