Spark dataframe 1 -:
+------+-------+---------+----+---+-------+
|city |product|date |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|cit
A scalable and easy way is to diff the two DataFrames with spark-extension:
import uk.co.gresearch.spark.diff._
df1.diff(df2, "city", "product", "date").show
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
|diff| city|product| date|left_sale|right_sale|left_exp|right_exp|left_wastage|right_wastage|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
| N|city 1|prod 2 |2017-08-25| 50| 50| 687| 687| 201| 201|
| C|city 1|prod 3 |2017-09-09| 236| 230| 431| 430| 169| 160|
| I|city 3|prod 4 |2017-09-18| null| 230| null| 431| null| 169|
| N|city 3|prod 3 |2017-09-08| 236| 236| 431| 431| 169| 169|
| D|city 2|prod 1 |2017-09-28| 358| null| 975| null| 193| null|
| I|city 1|prod 4 |2017-09-27| null| 350| null| 90| null| 190|
| N|city 1|prod 1 |2017-09-29| 358| 358| 975| 975| 193| 193|
| N|city 2|prod 2 |2017-08-24| 50| 50| 687| 687| 201| 201|
+----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
It identifies Inserted, Changed, Deleted and uN-changed rows.