I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...
(K, V1), (K, V2), ..., (K, Vn)
(K, [V1, V2, ...
You can use the RDD groupByKey method.
Input:
data = [(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')] rdd = sc.parallelize(data) result = rdd.groupByKey().collect()
Output:
[(1, ['a', 'b']), (2, ['c', 'd', 'e']), (3, ['f'])]