I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table.
Based on @Derek Kaknes's answer, here's the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata
// Extract relevant information: name (key), type & nullability (values) of columns
def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
df.schema.map { (structField: StructField) =>
structField.name.toLowerCase -> (structField.dataType, structField.nullable)
}.toMap
}
// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
schema2: Map[String, (DataType, Boolean)]
): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
(schema1.keys ++ schema2.keys).
map(_.toLowerCase).
toList.distinct.
flatMap { (columnName: String) =>
val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)
if (schema1FieldOpt == schema2FieldOpt) None
else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
}.toMap
}
getCleanedSchema
method extracts information of interest - column datatype & nullability and returns a map
of column name to tuple
getSchemaDifference
method returns a map
containing only those columns that differ in the two schemas. If a column is absent in one of the two schemas, then it's corresponding properties would be None