Simple explanation of MapReduce?

前端 未结 8 1145
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-04 04:57

Related to my CouchDB question.

Can anyone explain MapReduce in terms a numbnuts could understand?

相关标签:
8条回答
  • 2020-12-04 05:26

    If you are familiar with Python, following is the simplest possible explanation of MapReduce:

    In [2]: data = [1, 2, 3, 4, 5, 6]
    In [3]: mapped_result = map(lambda x: x*2, data)
    
    In [4]: mapped_result
    Out[4]: [2, 4, 6, 8, 10, 12]
    
    In [10]: final_result = reduce(lambda x, y: x+y, mapped_result)
    
    In [11]: final_result
    Out[11]: 42
    

    See how each segment of raw data was processed individually, in this case, multiplied by 2 (the map part of MapReduce). Based on the mapped_result, we concluded that the result would be 42 (the reduce part of MapReduce).

    An important conclusion from this example is the fact that each chunk of processing doesn't depend on another chunk. For instance, if thread_1 maps [1, 2, 3], and thread_2 maps [4, 5, 6], the eventual result of both the threads would still be [2, 4, 6, 8, 10, 12] but we have halved the processing time for this. The same can be said for the reduce operation and is the essence of how MapReduce works in parallel computing.

    0 讨论(0)
  • 2020-12-04 05:27

    Let's take the example from the Google paper. The goal of MapReduce is to be able to use efficiently a load of processing units working in parallels for some kind of algorithms. The exemple is the following: you want to extract all the words and their count in a set of documents.

    Typical implementation:

    for each document
        for each word in the document
            get the counter associated to the word for the document
            increment that counter 
        end for
    end for
    

    MapReduce implementation:

    Map phase (input: document key, document)
    for each word in the document
        emit an event with the word as the key and the value "1"
    end for
    
    Reduce phase (input: key (a word), an iterator going through the emitted values)
    for each value in the iterator
        sum up the value in a counter
    end for
    

    Around that, you'll have a master program which will partition the set of documents in "splits" which will be handled in parallel for the Map phase. The emitted values are written by the worker in a buffer specific to the worker. The master program then delegates other workers to perform the Reduce phase as soon as it is notified that the buffer is ready to be handled.

    Every worker output (being a Map or a Reduce worker) is in fact a file stored on the distributed file system (GFS for Google) or in the distributed database for CouchDB.

    0 讨论(0)
提交回复
热议问题