Schema comparison of two dataframes in scala

后端 未结 5 1955
悲哀的现实
悲哀的现实 2020-12-01 17:18

I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table.

5条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-01 17:42

    Option 1 - StructField.toString

    Here is another solution based on observation that the string representation of name + DataType + nullable is unique for each column. As seen here the toString implementation of StructField already supports that rule therefore we can directly use it to compare the columns of different schemas:

    import org.apache.spark.sql.types.{StructType, StructField}
    
    val schemaDiff = (s1 :StructType, s2 :StructType) => {
          val s1Keys = s1.map{_.toString}.toSet
          val s2Keys = s2.map{_.toString}.toSet
          val commonKeys =  s1Keys.intersect(s2Keys)
    
          val diffKeys = s1Keys ++ s2Keys -- commonKeys
    
          (s1 ++ s2).filter(sf => diffKeys.contains(sf.toString)).toList
    }
    

    Notice that the field name is case sensitive hence different column names imply different columns.

    The steps:

    1. For each schema generate a set of keys where each key has the format StructField($name,$dataType,$nullable)
    2. Get intersection of keys
    3. Subtract intersection from union of keys, that will give us the keys difference (diffKeys)
    4. Finally, from both schemas get only the elements that their string representation exists in diffKeys

    Option 2 - case class, eq, ==

    StructField and StructType are both case classes therefore we expect that the eq method and == operator are both based on a hash generated from the values of their members. You can confirm that by applying the change that @cheseaux pointed out, for example:

    val s1 = StructType(res39.schema.map(_.copy(metadata = Metadata.empty)))
    val s2 = StructType(targetRawData.schema.map(_.copy(metadata = Metadata.empty)))
    
    s1 == s2 // true 
    

    Which is expected since == can be applied between two lists of case classes and returns true only if both lists contain identical items. In the previous case == operator has been applied between two StructType objects and consequently between two Seq[StructField] objects as we can see in the constructor definition. As already discussed, the comparison in your case was failing since the value of the metadata differed between schemas.

    Attention, the == operator is not safe between schemas if we modify the order of the columns. That is because the list implementation of == considers the order of the items as well. To overcome that obstacle we can safely cast the collection into a set with toSet as we shown above.

    Finally, we can take advantage of the above observations and rewrite the first version into the next one:

    val schemaDiff = (s1 :StructType, s2 :StructType) => {
          val s1Set = s1.map(_.copy(metadata = Metadata.empty)).toSet
          val s2Set = s2.map(_.copy(metadata = Metadata.empty)).toSet
          val commonItems =  s1Set.intersect(s2Set)
    
          (s1Set ++ s2Set -- commonItems).toList
    }
    

    The performance drawback of the 2nd option is that we need to recreate the StructField item by setting metadata = Metadata.empty.

提交回复
热议问题