Compare two dataframes Pyspark

前端未结

关注

 4  1788

臣服心动 2021-02-04 22:28

I\'m trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames

df1 = spark.read.csv(\"/path/to/


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   不要未来只要你来
                                             
                
                
                (楼主)
            
              
              
                2021-02-04 22:54
              

            
            
                        
You can get that query build for you in PySpark and Scala by the spark-extension package.
It provides the diff transformation that does exactly that.
from gresearch.spark.diff import *

options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|diff|    changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|   N|         []|  1|      ABC|       ABC|    5000|     5000|          US|           US|
|   C|  [Address]|  2|      DEF|       DEF|    4000|     4000|          UK|          CAN|
|   C|      [sal]|  3|      GHI|       GHI|    3000|     3500|         JPN|          JPN|
|   C|[name, sal]|  4|      JKL|     JKL_M|    4500|     4800|         CHN|          CHN|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+


While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved. That package is well-tested, so you don't have to worry about getting that query right yourself.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复