What is Map/Reduce?

前端 未结 7 1692
伪装坚强ぢ
伪装坚强ぢ 2020-12-07 07:18

I hear a lot about map/reduce, especially in the context of Google\'s massively parallel compute system. What exactly is it?

7条回答
  •  臣服心动
    2020-12-07 07:37

    After getting most frustrated with either very long waffley or very short vague blog posts I eventually discovered this very good rigorous concise paper.

    Then I went ahead and made it more concise by translating into Scala, where I've provided the simplest case where a user simply just specifies the map and reduce parts of the application. In Hadoop/Spark, strictly speaking, a more complex model of programming is employed that require the user to explicitly specify 4 more functions outlined here: http://en.wikipedia.org/wiki/MapReduce#Dataflow

    import scalaz.syntax.id._
    
    trait MapReduceModel {
      type MultiSet[T] = Iterable[T]
    
      // `map` must be a pure function
      def mapPhase[K1, K2, V1, V2](map: ((K1, V1)) => MultiSet[(K2, V2)])
                                  (data: MultiSet[(K1, V1)]): MultiSet[(K2, V2)] = 
        data.flatMap(map)
    
      def shufflePhase[K2, V2](mappedData: MultiSet[(K2, V2)]): Map[K2, MultiSet[V2]] =
        mappedData.groupBy(_._1).mapValues(_.map(_._2))
    
      // `reduce` must be a monoid
      def reducePhase[K2, V2, V3](reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)])
                                 (shuffledData: Map[K2, MultiSet[V2]]): MultiSet[V3] =
        shuffledData.flatMap(reduce).map(_._2)
    
      def mapReduce[K1, K2, V1, V2, V3](data: MultiSet[(K1, V1)])
                                       (map: ((K1, V1)) => MultiSet[(K2, V2)])
                                       (reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)]): MultiSet[V3] =
        mapPhase(map)(data) |> shufflePhase |> reducePhase(reduce)
    }
    
    // Kinda how MapReduce works in Hadoop and Spark except `.par` would ensure 1 element gets a process/thread on a cluster
    // Furthermore, the splitting here won't enforce any kind of balance and is quite unnecessary anyway as one would expect
    // it to already be splitted on HDFS - i.e. the filename would constitute K1
    // The shuffle phase will also be parallelized, and use the same partition as the map phase.  
    abstract class ParMapReduce(mapParNum: Int, reduceParNum: Int) extends MapReduceModel {
      def split[T](splitNum: Int)(data: MultiSet[T]): Set[MultiSet[T]]
    
      override def mapPhase[K1, K2, V1, V2](map: ((K1, V1)) => MultiSet[(K2, V2)])
                                           (data: MultiSet[(K1, V1)]): MultiSet[(K2, V2)] = {
        val groupedByKey = data.groupBy(_._1).map(_._2)
        groupedByKey.flatMap(split(mapParNum / groupedByKey.size + 1))
        .par.flatMap(_.map(map)).flatten.toList
      }
    
      override def reducePhase[K2, V2, V3](reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)])
                                 (shuffledData: Map[K2, MultiSet[V2]]): MultiSet[V3] =
        shuffledData.map(g => split(reduceParNum / shuffledData.size + 1)(g._2).map((g._1, _)))
        .par.flatMap(_.map(reduce))
        .flatten.map(_._2).toList
    }
    

提交回复
热议问题