Compare two Spark dataframes

前端 未结 5 1034
再見小時候
再見小時候 2020-12-13 15:38

Spark dataframe 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|cit         


        
5条回答
  •  难免孤独
    2020-12-13 16:13

    Check out MegaSparkDiff its an open source project on GitHub that helps compare dataframes .. the project is not yet published in maven central but you can look at the SparkCompare scala class that compares 2 dataframes

    the below code snippet will give you 2 dataframes one has rows inLeftButNotInRight and another one having InRightButNotInLeft.

    if you do a JOIN between both then you can apply some logic to identify the missing primary keys (where possible) and then those keys would constitute the deleted records.

    We are working on adding the use case that you are looking for in the project. https://github.com/FINRAOS/MegaSparkDiff

    https://github.com/FINRAOS/MegaSparkDiff/blob/master/src/main/scala/org/finra/msd/sparkcompare/SparkCompare.scala

    private def compareSchemaDataFrames(left: DataFrame , leftViewName: String
                                  , right: DataFrame , rightViewName: String) :Pair[DataFrame, DataFrame] = {
        //make sure that column names match in both dataFrames
        if (!left.columns.sameElements(right.columns))
          {
            println("column names were different")
            throw new Exception("Column Names Did Not Match")
          }
    
        val leftCols = left.columns.mkString(",")
        val rightCols = right.columns.mkString(",")
    
        //group by all columns in both data frames
        val groupedLeft = left.sqlContext.sql("select " + leftCols + " , count(*) as recordRepeatCount from " +  leftViewName + " group by " + leftCols )
        val groupedRight = left.sqlContext.sql("select " + rightCols + " , count(*) as recordRepeatCount from " +  rightViewName + " group by " + rightCols )
    
        //do the except/subtract command
        val inLnotinR = groupedLeft.except(groupedRight).toDF()
        val inRnotinL = groupedRight.except(groupedLeft).toDF()
    
        return new ImmutablePair[DataFrame, DataFrame](inLnotinR, inRnotinL)
      }
    

提交回复
热议问题