In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and
SparkSQL
Dataframe
I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left") .where(col("column_just_present_in_df2").isNull)