Spark dataframe: collect () vs select ()

前端未结
关注
 6  478
情话喂你 2020-12-13 06:33
Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.
Will collect()

      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   伪装坚强ぢ
                                             
                
                
                (楼主)
            
              
              
                2020-12-13 07:19
              

            
            
                        
Short answer in bolds:


collect is mainly to serialize 
(loss  of parallelism preserving all other data characteristics of the dataframe)
For example with a PrintWriter pw you can't do direct df.foreach( r => pw.write(r) ), must  to use collect before foreach, df.collect.foreach(etc). 
PS: the "loss of parallelism" is not a "total loss"  because after serialization it can be distributed again to executors.
select is mainly to select columns, similar to projection in relational algebra 
(only similar in framework's context because Spark select not deduplicate data).
So, it is also a complement of filter in the framework's context. 




Commenting explanations of other answers: I like the Jeff's classification of Spark operations in transformations (as select) and actions (as collect). It is also good remember that transforms (including select) are lazily evaluated.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复