Reduce a key-value pair into a key-list pair with Apache Spark

前端未结
关注
 9  1400
生来不讨喜 2020-11-27 14:21
I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   执笔经年
                                             
                
                
                (楼主)
            
              
              
                2020-11-27 15:08
              

            
            
                        
If you want to do a reduceByKey where the type in the reduced KV pairs is different than the type in the original KV pairs, then one can use the function combineByKey.  What the function does is take KV pairs and combine them (by Key) into KC pairs where C is a different type than V.

One specifies 3 functions, createCombiner, mergeValue, mergeCombiners.  The first specifies how to transform a type V into a type C, the second describes how to combine a type C with a type V, and the last specifies how to combine a type C with another type C.  My code creates the K-V pairs:

Define the 3 functions as follows:

def Combiner(a):    #Turns value a (a tuple) into a list of a single tuple.
    return [a]

def MergeValue(a, b): #a is the new type [(,), (,), ..., (,)] and b is the old type (,)
    a.extend([b])
    return a

def MergeCombiners(a, b): #a is the new type [(,),...,(,)] and so is b, combine them
    a.extend(b)
    return a


Then, My_KMV = My_KV.combineByKey(Combiner, MergeValue, MergeCombiners)

The best resource I found on using this function is: http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

As others have pointed out, a.append(b) or a.extend(b) return None.  So the reduceByKey(lambda a, b: a.append(b)) returns None on the first pair of KV pairs, then fails on the second pair because None.append(b) fails.  You could work around this by defining a separate function:

 def My_Extend(a,b):
      a.extend(b)
      return a


Then call reduceByKey(lambda a, b: My_Extend(a,b)) (The use of the lambda function here may be unnecessary, but I have not tested this case.)
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复