How to get filename when running mapreduce job on EC2?

拥有回忆 提交于 2019-12-04 19:23:45

In the typical WordCount example, the file name which the map file is processing is ignored, since the the job output contains the consolidated word count for all the input files and not at a file level. But to get the word count at a file level, the input file name has to be used. Mappers using Python can get the file name using the os.environ["map.input.file"] command. The list of task execution environment variables is here.

The mapper instead of just emitting the key/value pair as <Hello, 1>, should also contain the input file name being processed. The following can be the emitted by the map <input.txt, <Hello, 1>>, where input.txt is the key and <Hello, 1> is the value.

Now, all the word counts for a particular file will be processed by a single reducer. The reducer must then aggregate the word count for that particular file.

As usual, a Combiner would help to decrease the network chatter between the mapper and the reducer and also to complete the job faster.

Check Data-Intensive Text Processing with MapReduce for more algorithms on text processing.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!