Compare two Spark dataframes

前端 未结 5 1021
再見小時候
再見小時候 2020-12-13 15:38

Spark dataframe 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|cit         


        
5条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-13 16:35

    see below the utility function I used to compare two dataframes using the following criteria

    1. Column length
    2. Record count
    3. Column by column comparing for all records

    Task three is done by using a hash of concatenation of all columns in a record.

    def verifyMatchAndSaveSignatureDifferences(oldDF: DataFrame, newDF: DataFrame, pkColumn: String) : Long = {
      assert(oldDF.columns.length == newDF.columns.length, s"column lengths don't match")
      assert(oldDF.count == newDF.count, s"record count don't match")
    
      def createHashColumn(df: DataFrame) : Column = {
         val colArr = df.columns
         md5(concat_ws("", (colArr.map(col(_))) : _*))
      }
    
      val newSigDF = newDF.select(col(pkColumn), createHashColumn(newDF).as("signature_new"))
      val oldSigDF = oldDF.select(col(pkColumn), createHashColumn(oldDF).as("signature"))
    
      val joinDF = newSigDF.join(oldSigDF, newSigDF("pkColumn") === oldSigDF("pkColumn")).where($"signature" !== $"signature_new").cache
    
      val diff = joinDF.count
      //write out any recorsd that don't match
      if (diff > 0)
         joinDF.write.saveAsTable("signature_table")
    
      joinDF.unpersist()
    
      diff
    }
    

    If the method returns 0, then both dataframes are exactly the same in everything else, a table named signature_table in default schema of hive will contains all records that differ in both.

    Hope this helps.

提交回复
热议问题