Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records):
val scores = sc.parallelize(Array(
(\"a\", 1),
I think this should be quite efficient:
Edited according to OP comments:
scores.mapValues(p => (p, p)).reduceByKey((u, v) => {
val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct
if (values.size > 1) (values(0), values(1))
else (values(0), values(0))
}).collect().foreach(println)