Using ReduceByKey to group list of values

后端 未结 2 1620
渐次进展
渐次进展 2020-12-20 06:15

I want to group list of values per key and was doing something like this:

sc.parallelize(Array((\"red\", \"zero\"), (\"yellow\", \"one\"), (\"red\", \"two\")         


        
相关标签:
2条回答
  • 2020-12-20 06:48
    sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
    .map(t => (t._1,List(t._2)))
    .reduceByKey(_:::_)
    .collect()
    
    Array[(String, List[String])] = Array((red,List(zero, two)), (yellow,List(one)))
    
    0 讨论(0)
  • 2020-12-20 07:12

    Use aggregateByKey:

     sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
    .aggregateByKey(ListBuffer.empty[String])(
            (numList, num) => {numList += num; numList},
             (numList1, numList2) => {numList1.appendAll(numList2); numList1})
    .mapValues(_.toList)
    .collect()
    
    scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))
    

    See this answer for the details on aggregateByKey, this link for the rationale behind using a mutable dataset ListBuffer.

    EDIT:

    Is there a way to achieve the same result using reduceByKey?

    The above is actually worse in performance, please see comments by @zero323 for the details.

    0 讨论(0)
提交回复
热议问题