Compare two Spark dataframes

前端 未结 5 1033
再見小時候
再見小時候 2020-12-13 15:38

Spark dataframe 1 -:

+------+-------+---------+----+---+-------+
|city  |product|date     |sale|exp|wastage|
+------+-------+---------+----+---+-------+
|cit         


        
5条回答
  •  难免孤独
    2020-12-13 16:28

    A scalable and easy way is to diff the two DataFrames with spark-extension:

    import uk.co.gresearch.spark.diff._
    
    df1.diff(df2, "city", "product", "date").show
    
    +----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
    |diff|  city|product|      date|left_sale|right_sale|left_exp|right_exp|left_wastage|right_wastage|
    +----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
    |   N|city 1|prod 2 |2017-08-25|       50|        50|     687|      687|         201|          201|
    |   C|city 1|prod 3 |2017-09-09|      236|       230|     431|      430|         169|          160|
    |   I|city 3|prod 4 |2017-09-18|     null|       230|    null|      431|        null|          169|
    |   N|city 3|prod 3 |2017-09-08|      236|       236|     431|      431|         169|          169|
    |   D|city 2|prod 1 |2017-09-28|      358|      null|     975|     null|         193|         null|
    |   I|city 1|prod 4 |2017-09-27|     null|       350|    null|       90|        null|          190|
    |   N|city 1|prod 1 |2017-09-29|      358|       358|     975|      975|         193|          193|
    |   N|city 2|prod 2 |2017-08-24|       50|        50|     687|      687|         201|          201|
    +----+------+-------+----------+---------+----------+--------+---------+------------+-------------+
    

    It identifies Inserted, Changed, Deleted and uN-changed rows.

提交回复
热议问题