Two equal combine keys do not get to the same reducer

烈酒焚心 提交于 2019-12-06 11:39:14

This is most probably because your combiner is running in both map and reduce phases (a little known 'feature').

Basically you are amending the key in the combiner, which may or may not run as map outputs are merged together in the reducer. After the combiner is run (reduce side), the keys are fed through the grouping comparator to determine what values back the Iterable passed to the reduce method (i'm skirting around the streaming aspect of the reduce phase here - the iterable is not backed by a set or list of values, more calls to iterator().next() return true if the grouping comparator detemines the current key and the last key are the same)

You can try and detect the current combiner phase side (map or reduce) by inspecting the Context (there is a Context.getTaskAttempt().isMap() method, but i have some memory of this being problematic too, and there even might be a JIRA ticket about this somewhere).

Bottom line, don't amend the key in the combiner unless you can find away to bypass this bevaviour if the combiner is running reduce side.

EDIT So investigating @Amar's comment, i put together some code (pastebin link) which adds in some verbose comparators, combiners, reducers etc. If you run a single map job then in the reduce phase no combiner will run, and map output will not be sorted again as it is already assumed to be sorted.

It is assumed to be sorted as it is sorted prior to being sent into the combiner class, and it assumed that the keys will come out untouched - hence still sorted. Remember a Combiner is meant to Combine values for a given key.

So with a single map and the given combiner, the reducer sees the keys in KeyOne, KeyTwo, KeyOne, KeyTwo, KeyOne order. The grouping comparator sees a transition between them and hence you get 6 calls to the reduce function

If you use two mappers, then the reducer knows it has two sorted segments (one from each map), and so still needs to sort them prior to reducing - but because the number of segments is below a threshold, the sort is done as an inline stream sort (again the segments are assumed to be sorted). You still be the wrong output with two mappers (10 records output from the reduce phase).

So again, don't amend the key in the combiner, this is not what the combiner is intended for.

Try this in the combiner instead:

context.write(new Text("KeyOne"), new Text("some value"));
context.write(new Text("KeyTwo"), new Text("some other value"));

The only way I see such a thing happening is if the key0 from one combiner is not found to be equal to the key0 from another. I am not sure how it would behave in case of keys pointing to the exact same instance (which is what would happen if you make the keys static).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!