Schema comparison of two dataframes in scala

后端 未结 5 1948
悲哀的现实
悲哀的现实 2020-12-01 17:18

I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table.

5条回答
  •  时光取名叫无心
    2020-12-01 17:27

    Based on @Derek Kaknes's answer, here's the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata

    // Extract relevant information: name (key), type & nullability (values) of columns
    def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
        df.schema.map { (structField: StructField) =>
          structField.name.toLowerCase -> (structField.dataType, structField.nullable)
        }.toMap
      }
    
    // Compare relevant information
    def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
                            schema2: Map[String, (DataType, Boolean)]
                           ): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
      (schema1.keys ++ schema2.keys).
        map(_.toLowerCase).
        toList.distinct.
        flatMap { (columnName: String) =>
          val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
          val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)
    
          if (schema1FieldOpt == schema2FieldOpt) None
          else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
        }.toMap
    }
    
    • getCleanedSchema method extracts information of interest - column datatype & nullability and returns a map of column name to tuple

    • getSchemaDifference method returns a map containing only those columns that differ in the two schemas. If a column is absent in one of the two schemas, then it's corresponding properties would be None

提交回复
热议问题