how to compare two data frames in scala

问题

I have two exactly same dataframes for comparison test

     df1
     ------------------------------------------
     year | state | count2 | count3 | count4|
     2014 | NJ    | 12332  | 54322  | 53422 |
     2014 | NJ    | 12332  | 53255  | 55324 |
     2015 | CO    | 12332  | 53255  | 55324 |
     2015 | MD    | 14463  | 76543  | 66433 |
     2016 | CT    | 14463  | 76543  | 66433 |
     2016 | CT    | 55325  | 76543  | 66433 |
     ------------------------------------------
     df2
     ------------------------------------------
     year | state | count2 | count3 | count4|
     2014 | NJ    | 12332  | 54322  | 53422 |
     2014 | NJ    | 65333  | 65555  | 125   |
     2015 | CO    | 12332  | 53255  | 55324 |
     2015 | MD    | 533    | 75     | 64524 |
     2016 | CT    | 14463  | 76543  | 66433 |
     2016 | CT    | 55325  | 76543  | 66433 |
     ------------------------------------------

I want to compare with these two dfs on count2 to count4, if the counts doesn't match then print out some message saying it is mismatching. here is my try

     val cols = df1.columns.filter(_ != "year").toList
     def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null).otherwise(array($"l.$name", $"r.$name")).as(name)
     val result = df1.as("l").join(df2.as("r"), "year").select($"year" :: cols.map(mapDiffs): _*)

it then compares with the same state with the same number, it didn't do what I wanted to do

     ------------------------------------------
     year | state | count2 | count3 | count4|
     2014 | NJ    | 12332  | 54322  | 53422 |
     2014 | NJ    | no     | no     | no    |
     2015 | CO    | 12332  | 53255  | 55324 |
     2015 | MD    | no     | no     | 64524 |
     2016 | CT    | 14463  | 76543  | 66433 |
     2016 | CT    | 55325  | 76543  | 66433 |
     ------------------------------------------

I want the result to come out as above, how do I achieve that?

edits, also in a different scenario if I want to compare only in one df, col to cols how do I do that? like

 ------------------------------------------
 year | state | count2 | count3 | count4|
 2014 | NJ    | 12332  | 54322  | 53422 |

I want to compare count3 and count 4 cols to count2, obviously count3 and count 4 do not match count 2, so I want the result to be

-----------------------------------------------
 year | state | count2 | count3    | count4   |
 2014 | NJ    | 12332  | mismatch  | mismatch |

Thank you!

回答1:

The dataframe join on year won't work for your mapDiffs method. You need a row-identifying column in df1 and df2 for the join.

import org.apache.spark.sql.functions._

val df1 = Seq(
  ("2014", "NJ", "12332", "54322", "53422"),
  ("2014", "NJ", "12332", "53255", "55324"),
  ("2015", "CO", "12332", "53255", "55324"),
  ("2015", "MD", "14463", "76543", "64524"),
  ("2016", "CT", "14463", "76543", "66433"),
  ("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")

val df2 = Seq(
  ("2014", "NJ", "12332", "54322", "53422"),
  ("2014", "NJ", "12332", "53255", "125"),
  ("2015", "CO", "12332", "53255", "55324"),
  ("2015", "MD", "533",   "75",    "64524"),
  ("2016", "CT", "14463", "76543", "66433"),
  ("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")

Skip this if you already have a row-identifying column (say, rowId) in the dataframes for thejoin:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd1 = df1.rdd.zipWithIndex.map{
  case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1i = spark.createDataFrame( rdd1,
  StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)

val rdd2 = df2.rdd.zipWithIndex.map{
  case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2i = spark.createDataFrame( rdd2,
  StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)

Now, define mapDiffs and apply it to the selected columns after joining the dataframes by rowId:

def mapDiffs(name: String) =
  when($"l.$name" === $"r.$name", $"l.$name").otherwise("no").as(name)

val cols = df1i.columns.filter(_.startsWith("count")).toList

val result = df1i.as("l").join(df2i.as("r"), "rowId").
  select($"l.rowId" :: $"l.year" :: cols.map(mapDiffs): _*)

// +-----+----+------+------+------+
// |rowId|year|count2|count3|count4|
// +-----+----+------+------+------+
// |    0|2014| 12332| 54322| 53422|
// |    5|2016| 55325| 76543| 66433|
// |    1|2014| 12332| 53255|    no|
// |    3|2015|    no|    no| 64524|
// |    2|2015| 12332| 53255| 55324|
// |    4|2016| 14463| 76543| 66433|
// +-----+----+------+------+------+

Note that there appears to be more discrepancies between df1 and df2 than just the 3 no-spots in your sample result. I've modified the sample data to make those 3 spots the only difference.

来源：https://stackoverflow.com/questions/51093862/how-to-compare-two-data-frames-in-scala

标签

scala

amazon-web-services

apache-spark

Hadoop

bigdata