Reduce a key-value pair into a key-list pair with Apache Spark

前端 未结 9 1394
生来不讨喜
生来不讨喜 2020-11-27 14:21

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

相关标签:
9条回答
  • 2020-11-27 15:08

    If you want to do a reduceByKey where the type in the reduced KV pairs is different than the type in the original KV pairs, then one can use the function combineByKey. What the function does is take KV pairs and combine them (by Key) into KC pairs where C is a different type than V.

    One specifies 3 functions, createCombiner, mergeValue, mergeCombiners. The first specifies how to transform a type V into a type C, the second describes how to combine a type C with a type V, and the last specifies how to combine a type C with another type C. My code creates the K-V pairs:

    Define the 3 functions as follows:

    def Combiner(a):    #Turns value a (a tuple) into a list of a single tuple.
        return [a]
    
    def MergeValue(a, b): #a is the new type [(,), (,), ..., (,)] and b is the old type (,)
        a.extend([b])
        return a
    
    def MergeCombiners(a, b): #a is the new type [(,),...,(,)] and so is b, combine them
        a.extend(b)
        return a
    

    Then, My_KMV = My_KV.combineByKey(Combiner, MergeValue, MergeCombiners)

    The best resource I found on using this function is: http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

    As others have pointed out, a.append(b) or a.extend(b) return None. So the reduceByKey(lambda a, b: a.append(b)) returns None on the first pair of KV pairs, then fails on the second pair because None.append(b) fails. You could work around this by defining a separate function:

     def My_Extend(a,b):
          a.extend(b)
          return a
    

    Then call reduceByKey(lambda a, b: My_Extend(a,b)) (The use of the lambda function here may be unnecessary, but I have not tested this case.)

    0 讨论(0)
  • 2020-11-27 15:13

    I tried with combineByKey ,here are my steps

    combineddatardd=sc.parallelize([("A", 3), ("A", 9), ("A", 12),("B", 4), ("B", 10), ("B", 11)])
    
    combineddatardd.combineByKey(lambda v:[v],lambda x,y:x+[y],lambda x,y:x+y).collect()
    

    Output:

    [('A', [3, 9, 12]), ('B', [4, 10, 11])]
    
    1. Define a function for combiner which sets accumulator to first key value pair which it encounters inside the partition convert the value to list in this step

    2. Define a function which mergers the new value of the same key to the accumulator value captured in step 1 Note:-convert the value to list in this function as accumulator value was converted to list in first step

    3. Define function to merge combiners outputs of individual partitions.

    0 讨论(0)
  • 2020-11-27 15:20

    You can use the RDD groupByKey method.

    Input:

    data = [(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')]
    rdd = sc.parallelize(data)
    result = rdd.groupByKey().collect()
    

    Output:

    [(1, ['a', 'b']), (2, ['c', 'd', 'e']), (3, ['f'])]
    
    0 讨论(0)
提交回复
热议问题