In the SparkSQL
1.6 API (scala) Dataframe
has functions for intersect and except, but not one for difference. Obviously, a combination of union and
If you are looking for Pyspark solution, you should use subtract() docs.
Also, unionAll is deprecated in 2.0, use union() instead.
df1.union(df2).subtract(df1.intersect(df2))
Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL
why not the below?
df1.except(df2)
You can always rewrite it as:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION
, INTERSECT
and EXCEPT
/ MINUS
is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.
I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)