Count on Spark Dataframe is extremely slow

前端未结

关注

 2  1144

不要未来只要你来 2020-12-30 10:11

I\'m creating a new DataFrame with a handful of records from a Join.

val joined_df = first_df.join(second_df, first_df.col(\"key\") ===
second_df.col(\"key\"


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   星月不相逢
                                             
                
                
                (楼主)
            
              
              
                2020-12-30 10:27
              

            
            
                        

  Everything is fast (under one second) except the count operation.


This is justified as follow : all operations before the count are called transformations and this type of spark operations are lazy i.e. it doesn't do any computation before calling an action (count in your example).

The second problem is in the repartition(1) : 

keep in mind that you'll lose all the parallelism offered by spark and you computation will be run in one executor (core if your are in standalone mode), so you must remove this step or change 1 to a number propositional to the number of your CPU cores (standalone mode) or the number of executors (cluster mode).


  The RDD conversion kicks in and literally takes hours to complete.


If I understand correctly you would covert the DataFrame to an RDD, this is really a bad practice in spark and you should avoid such conversion as possible as you can.
this is because the data in DataFrame and Dataset are encoded using special spark encoders (it's called tungstant if I well remembered it) which take much less memory then the JVM serialization encoders, so such conversion mean that spark will change the type of your data from his own one (which take much less memory and let spark optimize a lot of commutations by just work the encoded data and not serialize the data to work with and then deserialize it) to the JVM data type and this why DataFrames and Datasets are very powerful than RDDs

Hope this help you
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复