How to update a pyspark dataframe with new values from another dataframe?

前端 未结 3 1462
迷失自我
迷失自我 2020-12-19 16:04

I have two spark dataframes:

Dataframe A:

|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |

and dataframe B:

         


        
相关标签:
3条回答
  • 2020-12-19 16:48

    This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.

    For example:

    dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
        .select(
            'col_1',
            f.when(
                ~f.isnull(f.col('b.col_2')),
                f.col('b.col_2')
            ).otherwise(f.col('a.col_2')).alias('col_2'),
            'b.col_3'
        )\
        .union(dfB)\
        .dropDuplicates()\
        .sort('col_1')\
        .show()
    #+-----+-----+-----+
    #|col_1|col_2|col_3|
    #+-----+-----+-----+
    #|    a|  wew|    1|
    #|    b|  eee| null|
    #|    c|  rer|    3|
    #|    d|  yyy|    2|
    #+-----+-----+-----+
    

    Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:

    cols_to_update = ['col_2']
    
    dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
        .select(
            *[
                ['col_1'] + 
                [
                    f.when(
                        ~f.isnull(f.col('b.{}'.format(c))),
                        f.col('b.{}'.format(c))
                    ).otherwise(f.col('a.{}'.format(c))).alias(c)
                    for c in cols_to_update
                ] + 
                ['b.col_3']
            ]
        )\
        .union(dfB)\
        .dropDuplicates()\
        .sort('col_1')\
        .show()
    
    0 讨论(0)
  • 2020-12-19 16:49

    If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:

    columns_which_dont_change = [...]
    old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)
    
    0 讨论(0)
  • 2020-12-19 16:53

    I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.

        replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
        resultDf = dfA.subtract(replaceDf).union(dfB).show()
    

    Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.

    0 讨论(0)
提交回复
热议问题