Spark dataframe 1 -:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|cit
see below the utility function I used to compare two dataframes using the following criteria
Task three is done by using a hash of concatenation of all columns in a record.
def verifyMatchAndSaveSignatureDifferences(oldDF: DataFrame, newDF: DataFrame, pkColumn: String) : Long = {
assert(oldDF.columns.length == newDF.columns.length, s"column lengths don't match")
assert(oldDF.count == newDF.count, s"record count don't match")
def createHashColumn(df: DataFrame) : Column = {
val colArr = df.columns
md5(concat_ws("", (colArr.map(col(_))) : _*))
}
val newSigDF = newDF.select(col(pkColumn), createHashColumn(newDF).as("signature_new"))
val oldSigDF = oldDF.select(col(pkColumn), createHashColumn(oldDF).as("signature"))
val joinDF = newSigDF.join(oldSigDF, newSigDF("pkColumn") === oldSigDF("pkColumn")).where($"signature" !== $"signature_new").cache
val diff = joinDF.count
//write out any recorsd that don't match
if (diff > 0)
joinDF.write.saveAsTable("signature_table")
joinDF.unpersist()
diff
}
If the method returns 0, then both dataframes are exactly the same in everything else, a table named signature_table in default schema of hive will contains all records that differ in both.
Hope this helps.