Compare two dataframes Pyspark

前端 未结 4 1788
臣服心动
臣服心动 2021-02-04 22:28

I\'m trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames

df1 = spark.read.csv(\"/path/to/         


        
4条回答
  •  不要未来只要你来
    2021-02-04 22:54

    You can get that query build for you in PySpark and Scala by the spark-extension package. It provides the diff transformation that does exactly that.

    from gresearch.spark.diff import *
    
    options = DiffOptions().with_change_column('changes')
    df1.diff_with_options(df2, options, 'id').show()
    +----+-----------+---+---------+----------+--------+---------+------------+-------------+
    |diff|    changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
    +----+-----------+---+---------+----------+--------+---------+------------+-------------+
    |   N|         []|  1|      ABC|       ABC|    5000|     5000|          US|           US|
    |   C|  [Address]|  2|      DEF|       DEF|    4000|     4000|          UK|          CAN|
    |   C|      [sal]|  3|      GHI|       GHI|    3000|     3500|         JPN|          JPN|
    |   C|[name, sal]|  4|      JKL|     JKL_M|    4500|     4800|         CHN|          CHN|
    +----+-----------+---+---------+----------+--------+---------+------------+-------------+
    
    

    While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved. That package is well-tested, so you don't have to worry about getting that query right yourself.

提交回复
热议问题