Periodic hadoop jobs running (best practice)

做~自己de王妃 提交于 2019-12-01 07:34:21

问题


Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?


回答1:


Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010




回答2:


If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)



来源:https://stackoverflow.com/questions/3139780/periodic-hadoop-jobs-running-best-practice

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!