I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn)
into one Key-Multivalue pair (K, [V1, V2, ...
If you want to do a reduceByKey where the type in the reduced KV pairs is different than the type in the original KV pairs, then one can use the function combineByKey
. What the function does is take KV pairs and combine them (by Key) into KC pairs where C is a different type than V.
One specifies 3 functions, createCombiner, mergeValue, mergeCombiners. The first specifies how to transform a type V into a type C, the second describes how to combine a type C with a type V, and the last specifies how to combine a type C with another type C. My code creates the K-V pairs:
Define the 3 functions as follows:
def Combiner(a): #Turns value a (a tuple) into a list of a single tuple.
return [a]
def MergeValue(a, b): #a is the new type [(,), (,), ..., (,)] and b is the old type (,)
a.extend([b])
return a
def MergeCombiners(a, b): #a is the new type [(,),...,(,)] and so is b, combine them
a.extend(b)
return a
Then, My_KMV = My_KV.combineByKey(Combiner, MergeValue, MergeCombiners)
The best resource I found on using this function is: http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/
As others have pointed out, a.append(b)
or a.extend(b)
return None
. So the reduceByKey(lambda a, b: a.append(b))
returns None on the first pair of KV pairs, then fails on the second pair because None.append(b) fails. You could work around this by defining a separate function:
def My_Extend(a,b):
a.extend(b)
return a
Then call reduceByKey(lambda a, b: My_Extend(a,b))
(The use of the lambda function here may be unnecessary, but I have not tested this case.)