Reduce a key-value pair into a key-list pair with Apache Spark

前端 未结 9 1400
生来不讨喜
生来不讨喜 2020-11-27 14:21

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

9条回答
  •  执笔经年
    2020-11-27 15:08

    If you want to do a reduceByKey where the type in the reduced KV pairs is different than the type in the original KV pairs, then one can use the function combineByKey. What the function does is take KV pairs and combine them (by Key) into KC pairs where C is a different type than V.

    One specifies 3 functions, createCombiner, mergeValue, mergeCombiners. The first specifies how to transform a type V into a type C, the second describes how to combine a type C with a type V, and the last specifies how to combine a type C with another type C. My code creates the K-V pairs:

    Define the 3 functions as follows:

    def Combiner(a):    #Turns value a (a tuple) into a list of a single tuple.
        return [a]
    
    def MergeValue(a, b): #a is the new type [(,), (,), ..., (,)] and b is the old type (,)
        a.extend([b])
        return a
    
    def MergeCombiners(a, b): #a is the new type [(,),...,(,)] and so is b, combine them
        a.extend(b)
        return a
    

    Then, My_KMV = My_KV.combineByKey(Combiner, MergeValue, MergeCombiners)

    The best resource I found on using this function is: http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

    As others have pointed out, a.append(b) or a.extend(b) return None. So the reduceByKey(lambda a, b: a.append(b)) returns None on the first pair of KV pairs, then fails on the second pair because None.append(b) fails. You could work around this by defining a separate function:

     def My_Extend(a,b):
          a.extend(b)
          return a
    

    Then call reduceByKey(lambda a, b: My_Extend(a,b)) (The use of the lambda function here may be unnecessary, but I have not tested this case.)

提交回复
热议问题