I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...
(K, V1), (K, V2), ..., (K, Vn)
(K, [V1, V2, ...
I'm kind of late to the conversation, but here's my suggestion:
>>> foo = sc.parallelize([(1, ('a','b')), (2, ('c','d')), (1, ('x','y'))]) >>> foo.map(lambda (x,y): (x, [y])).reduceByKey(lambda p,q: p+q).collect() [(1, [('a', 'b'), ('x', 'y')]), (2, [('c', 'd')])]