What is going wrong with `unionAll` of Spark `DataFrame`?

前端未结

关注

 5  626

误落风尘 2020-11-29 07:19

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I\'m using some FunSuite for pass

5条回答

离开以前 (楼主)

2020-11-29 08:06
It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.

To quote PostgreSQL manual:

In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types

Column names, excluding the first table in the set operation, are simply ignored.

This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.

If you want to match using names you can do something like this
```
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
  val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
  a.select(columns: _*).unionAll(b.select(columns: _*))
}
```
To check both names and types it is should be enough to replace columns with:
```
a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...