Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema

问题

I tried taking a schema as a common schema by df.schema() and load all the CSV files to it .But fails as to the assigned schema , the headers of other CSV files doesnot match

Any suggestions would be appreciated. as in a function or spark script

回答1:

as I understand it. You want to Union / Merge files with different schemas ( though subset of one Master Schema) .. I wrote this function UnionPro which I think just suits your requirement -

def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {

    /**
     * This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
     * Creates a Unioned DataFrame
     */

    import spark.implicits._

    val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct

    def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
      allCols.toList.map(x => x match {
        case x if myCols.contains(x) => col(x)
        case _                       => lit(null).as(x)
      })
    }

    // Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases

    val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)

    val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)

    DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))

  }

Here is the sample test for it -


    val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
    val bDF = Seq(("C", 1), ("D", 2)).toDF("Name", "Sal")
    unionPro(List(aDF, bDF), spark).show

Which gives output as -

+----+----+----+
|Name|  ID| Sal|
+----+----+----+
|   A|   1|null|
|   B|   2|null|
|   C|null|   1|
|   D|null|   2|
+----+----+----+

来源：https://stackoverflow.com/questions/60701190/spark-merge-union-dataframe-with-different-schema-column-names-and-sequence

标签

python

apache-spark

pyspark