What is going wrong with `unionAll` of Spark `DataFrame`?

前端 未结 5 626
误落风尘
误落风尘 2020-11-29 07:19

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I\'m using some FunSuite for pass

5条回答
  •  离开以前
    2020-11-29 08:06

    It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.

    To quote PostgreSQL manual:

    In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types

    Column names, excluding the first table in the set operation, are simply ignored.

    This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.

    If you want to match using names you can do something like this

    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.functions.col
    
    def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
      val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
      a.select(columns: _*).unionAll(b.select(columns: _*))
    }
    

    To check both names and types it is should be enough to replace columns with:

    a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
    

提交回复
热议问题