Computing median in map reduce

后端未结

关注

 4  1901

礼貌的吻别 2020-12-14 20:33

Can someone example the computation of median/quantiles in map reduce?

My understanding of Datafu\'s median is that the \'n\' mappers sort the data and send the da

4条回答

陌清茗 (楼主)

2020-12-14 21:00
In many real-world scenarios, the cardinality of values in a dataset will be relatively small. In such cases, the problem can be efficiently solved with two MapReduce jobs:
1. Calculate frequencies of values in your dataset (Word Count job, basically)
2. Identity mapper + a reducer which calculates median based on < value - frequency> pairs
Job 1. will drastically reduce the amount of data and can be executed fully in parallel. Reducer of job 2. will only have to process n (n = cardinality of your value set) items instead of all values, as with the naive approach.

Below, an example reducer of the job 2. It's is python script that could be used directly in Hadoop streaming. Assumes values in your dataset are ints, but can be easily adopted for doubles
```
import sys

item_to_index_range = []
total_count = 0

# Store in memory a mapping of a value to the range of indexes it has in a sorted list of all values
for line in sys.stdin:
    item, count = line.strip().split("\t", 1)
    new_total_count = total_count + int(count)
    item_to_index_range.append((item, (total_count + 1,   new_total_count + 1)))
    total_count = new_total_count

# Calculate index(es) of middle items
middle_items_indexes = [(total_count / 2) + 1]
if total_count % 2 == 0:
    middle_items_indexes += [total_count / 2]

# Retrieve middle item(s) 
middle_items = []
for i in middle_items_indexes:
    for item, index_range in item_to_index_range:
        if i in range(*index_range):
            middle_items.append(item)
            continue

print sum(middle_items) / float(len(middle_items))
```
This answer builds up on top of a suggestion initially coming from the answer of Chris White. The answer suggests using a combiner as a mean to calculate frequencies of values. However, in MapReduce, combiners are not guaranteed to be always executed. This has some side effects:
- reducer will first have to compute final < value - frequency > pairs and then calculate median.
- In the worst case scenario, combiners will never be executed and the reducer will still have to struggle with processing all individual values
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...