Hadoop and Python: Disable Sorting

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-02 01:39:46
cabad

You should read more on basic MapReduce concepts. Even though the sorting may be unnecessary in some cases, the shuffling part of the "Shuffle & Sort" phase is an intrinsic part of the MapReduce model. The MapReduce framework (Hadoop) needs to group the output of the mappers so that it sends all the keys together to one single reducer, so that the reducer can actually "reduce" the data. When using streaming, the key value pairs are--by default--separated by a tab value. From your sample code in other SO questions, I can see that you are not providing producing "key, value" tuples, but rather just single text lines.

EDIT: Added the following answer to the question "How to make it sort numerically (e.g., 9 before 10)?"

Alternative 1: Prepend zeroes to your keys so that they all have the same size. "09" comes before "10".

Alternative 2: Use the KeyFieldBasedComparator, as indicated in this SO question.

No, as stated here:

If your number of reduce tasks is not 0, the hadoop framework will sort your results. There is no way around it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!