Running multiple mapreduce jobs with oozie?

情到浓时终转凉″ 提交于 2019-12-08 13:16:49

问题


As part of a workaround, I wanted to use two mapreduce jobs(instead of one) that ought to run in sequence for giving the desired affect.

The map function in each job simply emit each key,value pair without processing. The reduce functions in each job are different as they do different kind of processing.

I stumbled upon oozie and it seem to directly writes to the input stream of the consequent job (or doesn't it?) - this would be great since the intermediate data is large (I/O operation would become a bottleneck).

How can I achieve this with oozie (2 mr jobs in the workflow)?

I did go through the below resources, but they simply run a single job as a workflow: https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

Help appreciated.

Cheers


回答1:


Oozie is a system for describing the workflow of a job, where that job may contain a set of map reduce jobs, pig scripts, fs operations etc and supports fork and joining of the data flow.

It doesn't however allow you to stream the input of one MR job as the input to another - the map-reduce action in oozie still requires an output format of some type, typically a File based on, so your output from job 1 will still be serialized via HDFS, before being processed by job 2.

The oozie documentation has an example with multiple MR jobs, including a fork:

http://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#Appendix_B_Workflow_Examples




回答2:


There is, look at the ChainMapper class in Hadoop. It allows you to forward the map output of one mapper directly into the input of the next mapper without hitting the disk.



来源:https://stackoverflow.com/questions/13369260/running-multiple-mapreduce-jobs-with-oozie

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!