Reduce a key-value pair into a key-list pair with Apache Spark

前端未结

关注

 9  1409

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-11-27 15:08
              
            
            
                                                                       
If you want to do a reduceByKey where the type in the reduced KV pairs is different than the type in the original KV pairs, then one can use the function combineByKey.  What the function does is take KV pairs and combine them (by Key) into KC pairs where C is a different type than V.

One specifies 3 functions, createCombiner, mergeValue, mergeCombiners.  The first specifies how to transform a type V into a type C, the second describes how to combine a type C with a type V, and the last specifies how to combine a type C with another type C.  My code creates the K-V pairs:

Define the 3 functions as follows:

def Combiner(a):    #Turns value a (a tuple) into a list of a single tuple.
    return [a]

def MergeValue(a, b): #a is the new type [(,), (,), ..., (,)] and b is the old type (,)
    a.extend([b])
    return a

def MergeCombiners(a, b): #a is the new type [(,),...,(,)] and so is b, combine them
    a.extend(b)
    return a


Then, My_KMV = My_KV.combineByKey(Combiner, MergeValue, MergeCombiners)

The best resource I found on using this function is: http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

As others have pointed out, a.append(b) or a.extend(b) return None.  So the reduceByKey(lambda a, b: a.append(b)) returns None on the first pair of KV pairs, then fails on the second pair because None.append(b) fails.  You could work around this by defining a separate function:

 def My_Extend(a,b):
      a.extend(b)
      return a


Then call reduceByKey(lambda a, b: My_Extend(a,b)) (The use of the lambda function here may be unnecessary, but I have not tested this case.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2020-11-27 15:13
              
            
            
                                                                       
I tried with combineByKey ,here are my steps 

combineddatardd=sc.parallelize([("A", 3), ("A", 9), ("A", 12),("B", 4), ("B", 10), ("B", 11)])

combineddatardd.combineByKey(lambda v:[v],lambda x,y:x+[y],lambda x,y:x+y).collect()


Output:

[('A', [3, 9, 12]), ('B', [4, 10, 11])]



Define a function for combiner which sets accumulator to first key value pair which it encounters inside the partition convert the value to list in this step
Define a function which mergers the new value  of the same key to the  accumulator value captured in step 1 Note:-convert the value to list in this function as accumulator value was converted to list in first step 
Define function to merge combiners outputs of individual partitions.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-11-27 15:20
              
            
            
                                                                       
You can use the RDD groupByKey method.

Input:

data = [(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')]
rdd = sc.parallelize(data)
result = rdd.groupByKey().collect()


Output:

[(1, ['a', 'b']), (2, ['c', 'd', 'e']), (3, ['f'])]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复