reduceByKey: How does it work internally?

前端 未结 4 1542
Happy的楠姐
Happy的楠姐 2020-12-04 09:22

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code:

val lines = sc.textFile(\"         


        
4条回答
  •  爱一瞬间的悲伤
    2020-12-04 09:58

    In your example of

    val counts = pairs.reduceByKey((a,b) => a+b)
    

    a and b are both Int accumulators for _2 of the tuples in pairs. reduceKey will take two tuples with the same value s and use their _2 values as a and b, producing a new Tuple[String,Int]. This operation is repeated until there is only one tuple for each key s.

    Unlike non-Spark (or, really, non-parallel) reduceByKey where the first element is always the accumulator and the second a value, reduceByKey operates in a distributed fashion, i.e. each node will reduce it's set of tuples into a collection of uniquely-keyed tuples and then reduce the tuples from multiple nodes until there is a final uniquely-keyed set of tuples. This means as the results from nodes are reduced, a and b represent already reduced accumulators.

提交回复
热议问题