Computing median in map reduce

后端 未结 4 1901
礼貌的吻别
礼貌的吻别 2020-12-14 20:33

Can someone example the computation of median/quantiles in map reduce?

My understanding of Datafu\'s median is that the \'n\' mappers sort the data and send the da

4条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-14 21:00

    In many real-world scenarios, the cardinality of values in a dataset will be relatively small. In such cases, the problem can be efficiently solved with two MapReduce jobs:

    1. Calculate frequencies of values in your dataset (Word Count job, basically)
    2. Identity mapper + a reducer which calculates median based on < value - frequency> pairs

    Job 1. will drastically reduce the amount of data and can be executed fully in parallel. Reducer of job 2. will only have to process n (n = cardinality of your value set) items instead of all values, as with the naive approach.

    Below, an example reducer of the job 2. It's is python script that could be used directly in Hadoop streaming. Assumes values in your dataset are ints, but can be easily adopted for doubles

    import sys
    
    item_to_index_range = []
    total_count = 0
    
    # Store in memory a mapping of a value to the range of indexes it has in a sorted list of all values
    for line in sys.stdin:
        item, count = line.strip().split("\t", 1)
        new_total_count = total_count + int(count)
        item_to_index_range.append((item, (total_count + 1,   new_total_count + 1)))
        total_count = new_total_count
    
    # Calculate index(es) of middle items
    middle_items_indexes = [(total_count / 2) + 1]
    if total_count % 2 == 0:
        middle_items_indexes += [total_count / 2]
    
    # Retrieve middle item(s) 
    middle_items = []
    for i in middle_items_indexes:
        for item, index_range in item_to_index_range:
            if i in range(*index_range):
                middle_items.append(item)
                continue
    
    print sum(middle_items) / float(len(middle_items))
    

    This answer builds up on top of a suggestion initially coming from the answer of Chris White. The answer suggests using a combiner as a mean to calculate frequencies of values. However, in MapReduce, combiners are not guaranteed to be always executed. This has some side effects:

    • reducer will first have to compute final < value - frequency > pairs and then calculate median.
    • In the worst case scenario, combiners will never be executed and the reducer will still have to struggle with processing all individual values

提交回复
热议问题