I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn)
into one Key-Multivalue pair (K, [V1, V2, ...
tl;dr If you really require operation like this use groupByKey
as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.
reduceByKey
with list concatenation is not an acceptable solution because:
+
to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N2).groupByKey
. Amount of data that has to be shuffled as well as the size of the final structure are the same.reduceByKey
and groupByKey
.combineByKey
with list.extend
is a suboptimal solution because:
MergeValue
(this could be optimized by using list.append
directly on the new item).list.append
it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey
and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.