Using reduceByKey in Apache Spark (Scala)

前端 未结 3 1664
囚心锁ツ
囚心锁ツ 2020-12-24 02:23

I have a list of Tuples of type : (user id, name, count).

For example,

val x = sc.parallelize(List(
    (\"a\", \"b\", 1),
    (\"a\", \"b\", 1),
           


        
3条回答
  •  生来不讨喜
    2020-12-24 03:13

    Following your code:

    val byKey = x.map({case (id,uri,count) => (id,uri)->count})
    

    You could do:

    val reducedByKey = byKey.reduceByKey(_ + _)
    
    scala> reducedByKey.collect.foreach(println)
    ((a,d),1)
    ((a,b),2)
    ((c,b),1)
    

    PairRDDFunctions[K,V].reduceByKey takes an associative reduce function that can be applied to the to type V of the RDD[(K,V)]. In other words, you need a function f[V](e1:V, e2:V) : V . In this particular case with sum on Ints: (x:Int, y:Int) => x+y or _ + _ in short underscore notation.

    For the record: reduceByKey performs better than groupByKey because it attemps to apply the reduce function locally before the shuffle/reduce phase. groupByKey will force a shuffle of all elements before grouping.

提交回复
热议问题