Compare two Spark dataframes

前端未结

关注

 5  1024

Spark dataframe 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|cit


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-13 16:13
              
            
            
                                                                       
Check out MegaSparkDiff its an open source project on GitHub that helps compare dataframes .. the project is not yet published in maven central but you can look at the SparkCompare scala class that compares 2 dataframes

the below code snippet will give you 2 dataframes one has rows inLeftButNotInRight and another one having InRightButNotInLeft.

if you do a JOIN between both then you can apply some logic to identify the missing primary keys (where possible) and then those keys would constitute the deleted records.

We are working on adding the use case that you are looking for in the project.
https://github.com/FINRAOS/MegaSparkDiff

https://github.com/FINRAOS/MegaSparkDiff/blob/master/src/main/scala/org/finra/msd/sparkcompare/SparkCompare.scala

private def compareSchemaDataFrames(left: DataFrame , leftViewName: String
                              , right: DataFrame , rightViewName: String) :Pair[DataFrame, DataFrame] = {
    //make sure that column names match in both dataFrames
    if (!left.columns.sameElements(right.columns))
      {
        println("column names were different")
        throw new Exception("Column Names Did Not Match")
      }

    val leftCols = left.columns.mkString(",")
    val rightCols = right.columns.mkString(",")

    //group by all columns in both data frames
    val groupedLeft = left.sqlContext.sql("select " + leftCols + " , count(*) as recordRepeatCount from " +  leftViewName + " group by " + leftCols )
    val groupedRight = left.sqlContext.sql("select " + rightCols + " , count(*) as recordRepeatCount from " +  rightViewName + " group by " + rightCols )

    //do the except/subtract command
    val inLnotinR = groupedLeft.except(groupedRight).toDF()
    val inRnotinL = groupedRight.except(groupedLeft).toDF()

    return new ImmutablePair[DataFrame, DataFrame](inLnotinR, inRnotinL)
  }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2020-12-13 16:14
              
            
            
                                                                       
I am not sure about finding the deleted and modified records but you can use except function to get the difference 

df2.except(df1)


This returns the rows that has been added or modified in dataframe2 or record with changes.
Output:

+------+-------+---------+----+---+-------+
|  city|product|     date|sale|exp|wastage|
+------+-------+---------+----+---+-------+
|city 3| prod 4|9/18/2017| 230|431|    169|
|city 1| prod 4|9/27/2017| 350| 90|    190|
|city 1| prod 3|9/9/2017 | 230|430|    160|
+------+-------+---------+----+---+-------+


You can also try with join and filter to get the changed and unchanged data as 

df1.join(df2, Seq("city","product", "date"), "left").show(false)
df1.join(df2, Seq("city","product", "date"), "right").show(false)


Hope this helps!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-13 16:28
              
            
            
                                                                       
A scalable and easy way is to diff the two DataFrames with spark-extension:

import uk.co.gresearch.spark.diff._

df1.diff(df2, "city", "product", "date").show

+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|diff|  city|product|      date|left_sale|right_sale|left_exp|right_exp|left_wastage|right_wastage|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|   N|city 1|prod 2 |2017-08-25|       50|        50|     687|      687|         201|          201|
|   C|city 1|prod 3 |2017-09-09|      236|       230|     431|      430|         169|          160|
|   I|city 3|prod 4 |2017-09-18|     null|       230|    null|      431|        null|          169|
|   N|city 3|prod 3 |2017-09-08|      236|       236|     431|      431|         169|          169|
|   D|city 2|prod 1 |2017-09-28|      358|      null|     975|     null|         193|         null|
|   I|city 1|prod 4 |2017-09-27|     null|       350|    null|       90|        null|          190|
|   N|city 1|prod 1 |2017-09-29|      358|       358|     975|      975|         193|          193|
|   N|city 2|prod 2 |2017-08-24|       50|        50|     687|      687|         201|          201|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+


It identifies Inserted, Changed, Deleted and uN-changed rows.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-12-13 16:33
              
            
            
                                                                       
Using Spark different join types seems to be the key to computing Deletions, Additions, and Updates on rows.

This question illustrates the different types of joins depending on what you are trying to achieve: 
What are the various join types in Spark?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-13 16:35
              
            
            
                                                                       
see below the utility function I used to compare two dataframes using the following criteria


Column length
Record count
Column by column comparing for all records


Task three is done by using a hash of concatenation of all columns in a record.

def verifyMatchAndSaveSignatureDifferences(oldDF: DataFrame, newDF: DataFrame, pkColumn: String) : Long = {
  assert(oldDF.columns.length == newDF.columns.length, s"column lengths don't match")
  assert(oldDF.count == newDF.count, s"record count don't match")

  def createHashColumn(df: DataFrame) : Column = {
     val colArr = df.columns
     md5(concat_ws("", (colArr.map(col(_))) : _*))
  }

  val newSigDF = newDF.select(col(pkColumn), createHashColumn(newDF).as("signature_new"))
  val oldSigDF = oldDF.select(col(pkColumn), createHashColumn(oldDF).as("signature"))

  val joinDF = newSigDF.join(oldSigDF, newSigDF("pkColumn") === oldSigDF("pkColumn")).where($"signature" !== $"signature_new").cache

  val diff = joinDF.count
  //write out any recorsd that don't match
  if (diff > 0)
     joinDF.write.saveAsTable("signature_table")

  joinDF.unpersist()

  diff
}


If the method returns 0, then both dataframes are exactly the same in everything else, a table named signature_table in default schema of hive will contains all records that differ in both.

Hope this helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复