How to obtain the symmetric difference between two DataFrames?

前端未结

关注

 5  1008

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and

相关标签:

5条回答

小蘑菇

2020-12-02 19:29

If you are looking for Pyspark solution, you should use subtract() docs.

Also, unionAll is deprecated in 2.0, use union() instead.

df1.union(df2).subtract(df1.intersect(df2))

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-12-02 19:31
Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:

https://issues.apache.org/jira/browse/SPARK-21274

As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
```
SELECT a,b,c
FROM    tab1 t1
     LEFT OUTER JOIN 
        tab2 t2
     ON (
        (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
     )
WHERE
    COALESCE(t2.a, t2.b, t2.c) IS NULL
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-12-02 19:35
why not the below?
```
df1.except(df2)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-02 19:39
You can always rewrite it as:
```
df1.unionAll(df2).except(df1.intersect(df2))
```
Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-12-02 19:43
I think it could be more efficient using a left join and then filtering out the nulls.
```
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...