I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table.
I've had this issue before and it was caused by differences in the StructField.metadata
attribute. It is almost impossible to identify this out of the box, as simple inspection of the StructField
's will only show the name, datatype and nullable values. My suggestion to debug it would be to compare the metadata of your fields. Something like this maybe:
res39.schema.zip(targetRawData.schema).foreach{ case (r: StructField, t: StructField) =>
println(s"Field: ${r.name}\n--| res_meta: ${r.metadata}\n--|target_meta: ${t.metadata}")}
If you want to compare schemas but ignore metadata, then I don't have a great solution. The best that I've been able to come up with is to iterate over the StructFields
and manually remove the metadata, then create a temporary copy of the dataframe without metadata. So you can do something like this (assuming that df
is the dataframe you want to strip of of metadata):
val schemaWithoutMetadata = StructType(df.schema.map{ case f: StructField =>
StructField(f.name, f.dataType, f.nullable)
})
val tmpDF = spark.sqlContext.createDataFrame(df.rdd, schemaWithoutMetadata)
Then you can either compare the dataframes directly or compare the schemas the way you have been attempting. I assume that this solution would not be performant, so should only be used on small datasets.
I've just had the exact same problem. When you read data from Hive the schema's StructField
component will sometimes contain Hive metadata in the field metadata
.
You can't see it when printing the schemas because this field is not part of the toString
definition.
Here is the solution I've decided to use, I just get a copy of the schema with an empty Metadata before comparing it :
schema.map(_.copy(metadata = Metadata.empty))
This is a Java level Object comparison problem, you should try with .equals(). This mostly works unless different SourceTypes introduce metadata, nullability issues.
Based on @Derek Kaknes's answer, here's the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata
// Extract relevant information: name (key), type & nullability (values) of columns
def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
df.schema.map { (structField: StructField) =>
structField.name.toLowerCase -> (structField.dataType, structField.nullable)
}.toMap
}
// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
schema2: Map[String, (DataType, Boolean)]
): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
(schema1.keys ++ schema2.keys).
map(_.toLowerCase).
toList.distinct.
flatMap { (columnName: String) =>
val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)
if (schema1FieldOpt == schema2FieldOpt) None
else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
}.toMap
}
getCleanedSchema
method extracts information of interest - column datatype & nullability and returns a map
of column name to tuple
getSchemaDifference
method returns a map
containing only those columns that differ in the two schemas. If a column is absent in one of the two schemas, then it's corresponding properties would be None
Option 1 - StructField.toString
Here is another solution based on observation that the string representation of name + DataType + nullable
is unique for each column. As seen here the toString implementation of StructField
already supports that rule therefore we can directly use it to compare the columns of different schemas:
import org.apache.spark.sql.types.{StructType, StructField}
val schemaDiff = (s1 :StructType, s2 :StructType) => {
val s1Keys = s1.map{_.toString}.toSet
val s2Keys = s2.map{_.toString}.toSet
val commonKeys = s1Keys.intersect(s2Keys)
val diffKeys = s1Keys ++ s2Keys -- commonKeys
(s1 ++ s2).filter(sf => diffKeys.contains(sf.toString)).toList
}
Notice that the field name is case sensitive hence different column names imply different columns.
The steps:
StructField($name,$dataType,$nullable)
Option 2 - case class, eq, ==
StructField
and StructType
are both case classes therefore we expect that the eq
method and ==
operator are both based on a hash generated from the values of their members. You can confirm that by applying the change that @cheseaux pointed out, for example:
val s1 = StructType(res39.schema.map(_.copy(metadata = Metadata.empty)))
val s2 = StructType(targetRawData.schema.map(_.copy(metadata = Metadata.empty)))
s1 == s2 // true
Which is expected since ==
can be applied between two lists of case classes and returns true only if both lists contain identical items. In the previous case ==
operator has been applied between two StructType
objects and consequently between two Seq[StructField]
objects as we can see in the constructor definition. As already discussed, the comparison in your case was failing since the value of the metadata
differed between schemas.
Attention, the ==
operator is not safe between schemas if we modify the order of the columns. That is because the list implementation of ==
considers the order of the items as well. To overcome that obstacle we can safely cast the collection into a set with toSet
as we shown above.
Finally, we can take advantage of the above observations and rewrite the first version into the next one:
val schemaDiff = (s1 :StructType, s2 :StructType) => {
val s1Set = s1.map(_.copy(metadata = Metadata.empty)).toSet
val s2Set = s2.map(_.copy(metadata = Metadata.empty)).toSet
val commonItems = s1Set.intersect(s2Set)
(s1Set ++ s2Set -- commonItems).toList
}
The performance drawback of the 2nd option is that we need to recreate the StructField
item by setting metadata = Metadata.empty
.