I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...
tl;dr If you really require operation like this use groupByKey as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.
reduceByKey with list concatenation is not an acceptable solution because:
+ to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N2).groupByKey. Amount of data that has to be shuffled as well as the size of the final structure are the same.reduceByKey and groupByKey.combineByKey with list.extend is a suboptimal solution because:
MergeValue (this could be optimized by using list.append directly on the new item).list.append it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.