I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn)
into one Key-Multivalue pair (K, [V1, V2, ...
I tried with combineByKey ,here are my steps
combineddatardd=sc.parallelize([("A", 3), ("A", 9), ("A", 12),("B", 4), ("B", 10), ("B", 11)])
combineddatardd.combineByKey(lambda v:[v],lambda x,y:x+[y],lambda x,y:x+y).collect()
Output:
[('A', [3, 9, 12]), ('B', [4, 10, 11])]
Define a function for combiner which sets accumulator to first key value pair which it encounters inside the partition convert the value to list in this step
Define a function which mergers the new value of the same key to the accumulator value captured in step 1 Note:-convert the value to list in this function as accumulator value was converted to list in first step
Define function to merge combiners outputs of individual partitions.