I want to group list of values per key and was doing something like this:
sc.parallelize(Array((\"red\", \"zero\"), (\"yellow\", \"one\"), (\"red\", \"two\")
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.map(t => (t._1,List(t._2)))
.reduceByKey(_:::_)
.collect()
Array[(String, List[String])] = Array((red,List(zero, two)), (yellow,List(one)))
Use aggregateByKey
:
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.aggregateByKey(ListBuffer.empty[String])(
(numList, num) => {numList += num; numList},
(numList1, numList2) => {numList1.appendAll(numList2); numList1})
.mapValues(_.toList)
.collect()
scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))
See this answer for the details on aggregateByKey
, this link for the rationale behind using a mutable dataset ListBuffer
.
EDIT:
Is there a way to achieve the same result using reduceByKey?
The above is actually worse in performance, please see comments by @zero323 for the details.