Secondary sorting in Map-Reduce

前端 未结 1 1967
离开以前
离开以前 2021-01-07 04:33

I understood the way of sorting the values of a particular key before the key enters the reducer. I learned that it can be done by writing three methods viz, keycomparator,

1条回答
  •  难免孤独
    2021-01-07 04:42

    This may be surprising to know, but each iteration of the values Iterable actually updates the key reference too:

    protected void reduce(K key, Iterable values, Context context) {
        for (V value : values) {
            // key object contents will update for each iteration of this loop
        }
    }
    

    I know this works for the new mapreduce API, i haven't traced it for the old mapred API.

    So in answer to your question, all the keys will be available, the first key will relate to the first sorted key of the group.

    EDIT: Some additional information as to how and why this works:

    There are two comparators that the reducer uses to process the key/value pairs output by the map stage:

    • the key ordering comparator - This comparator is applied first and orders all the KV pairs. Conceptually you are still dealing with the serialized bytes at this stage.
    • the key group comparator - This comparator is responsible for determining when the previous and current key 'differ', denoting the boundary between one group of KV pairs and another

    Under the hood, the reference to the key and value never changes, each call to Iterable.Iterator.next() advances the pointer in the underlying byte stream to the next KV pair. If the key grouper determines that the current set of keys bytes and previous set are comparatively the same key, then the hasNext method of the value Iterable.iterator() will return true, otherwise false. If true is returned, the bytes are deserialized into the Key and Value instances for consumption in your reduce method.

    0 讨论(0)
提交回复
热议问题