How to obtain the symmetric difference between two DataFrames?

前端 未结 5 1011
借酒劲吻你
借酒劲吻你 2020-12-02 19:18

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and

5条回答
  •  孤城傲影
    2020-12-02 19:31

    Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:

    https://issues.apache.org/jira/browse/SPARK-21274

    As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as

    SELECT a,b,c
    FROM    tab1 t1
         LEFT OUTER JOIN 
            tab2 t2
         ON (
            (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
         )
    WHERE
        COALESCE(t2.a, t2.b, t2.c) IS NULL
    

提交回复
热议问题