Using reduceByKey in Apache Spark (Scala)

前端 未结 3 1663
囚心锁ツ
囚心锁ツ 2020-12-24 02:23

I have a list of Tuples of type : (user id, name, count).

For example,

val x = sc.parallelize(List(
    (\"a\", \"b\", 1),
    (\"a\", \"b\", 1),
           


        
3条回答
  •  悲哀的现实
    2020-12-24 03:16

    Your origin data structure is: RDD[(String, String, Int)], and reduceByKey can only be used if data structure is RDD[(K, V)].

    val kv = x.map(e => e._1 -> e._2 -> e._3) // kv is RDD[((String, String), Int)]
    val reduced = kv.reduceByKey(_ + _)       // reduced is RDD[((String, String), Int)]
    val kv2 = reduced.map(e => e._1._1 -> (e._1._2 -> e._2)) // kv2 is RDD[(String, (String, Int))]
    val grouped = kv2.groupByKey()            // grouped is RDD[(String, Iterable[(String, Int)])]
    grouped.foreach(println)
    

提交回复
热议问题