Chaining multiple mapreduce tasks in Hadoop streaming

落爺英雄遲暮 提交于 2019-12-18 02:50:58

问题


I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.


回答1:


Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading




回答2:


You can try out Yelp's MRJob to get your job done.. Its an opensource MapReduce Library that allows you to write chained jobs that can be run atop Hadoop Streaming on your Hadoop Cluster or EC2.. Its pretty elegant and easy to use, and has a method called steps which you can override to specify the exact chain of mappers and reducers that you want your data to go through.

Checkout the source at https://github.com/Yelp/mrjob
and documentation at http://packages.python.org/mrjob/




回答3:


Typically the way I do this with Hadoop streaming and Python is from within my bash script that I create to run the jobs in the first place. Always I run from a bash script, this way I can get emails on errors and emails on success and make them more flexible passing in parameters from another Ruby or Python script wrapping it that can work in a larger event processing system.

So, the output of the first command (job) is the input to the next command (job) which can be variables in your bash script passed in as an argument from the command line (simple and quick)

You might want to checkout Oozie http://yahoo.github.com/oozie/design.html a workflow engine for Hadoop that will help to-do this also (supports streaming, not a problem). I did not have this when I started so I ended up having to build my own thing but this is a kewl system and useful!!!!




回答4:


If you are already writing your mapper and reducer in Python, I would consider using Dumbo where such an operation is straightforward. The sequence of your map reduce jobs, your mapper, reducer etc. are all in one python script that can be run from the command line.



来源:https://stackoverflow.com/questions/4626356/chaining-multiple-mapreduce-tasks-in-hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!