Multiple Inputs with MRJob

前端 未结 5 914
失恋的感觉
失恋的感觉 2020-12-29 14:58

I\'m trying to learn to use Yelp\'s Python API for MapReduce, MRJob. Their simple word counter example makes sense, but I\'m curious how one would handle an application invo

5条回答
  •  Happy的楠姐
    2020-12-29 15:29

    The actual answer to your question is that mrjob does not quite yet support the hadoop streaming join pattern, which is to read the map_input_file environment variable (which exposes the map.input.file property) to determine which type of file you are dealing with based on its path and/or name.

    You might still be able to pull it off, if you can easily detect from just reading the data itself which type it belongs to, as is displayed in this article:

    http://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/

    However that's not always possible...

    Otherwise myjob looks fantastic and I wish they could add support for this in the future. Until then this is pretty much a deal breaker for me.

提交回复
热议问题