Reduce a key-value pair into a key-list pair with Apache Spark

前端未结
关注
 9  1403
生来不讨喜 2020-11-27 14:21
I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   难免孤独
                                             
                
                
                (楼主)
            
              
              
                2020-11-27 15:01
              

            
            
                        
tl;dr If you really require operation like this use groupByKey as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.

reduceByKey with list concatenation is not an acceptable solution because:


Requires initialization of O(N) lists.
Each application of + to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N²).
Doesn't address any of the problems introduced by groupByKey. Amount of data that has to be shuffled as well as the size of the final structure are the same.
Unlike suggested by one of the answers there is no difference in a level of parallelism between implementation using reduceByKey and groupByKey.


combineByKey with list.extend is a suboptimal solution because:


Creates O(N) list objects in MergeValue (this could be optimized by using list.append directly on the new item).
If optimized with list.append it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复