How to update a pyspark dataframe with new values from another dataframe?

前端 未结 3 1463
迷失自我
迷失自我 2020-12-19 16:04

I have two spark dataframes:

Dataframe A:

|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |

and dataframe B:

         


        
3条回答
  •  粉色の甜心
    2020-12-19 16:48

    This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.

    For example:

    dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
        .select(
            'col_1',
            f.when(
                ~f.isnull(f.col('b.col_2')),
                f.col('b.col_2')
            ).otherwise(f.col('a.col_2')).alias('col_2'),
            'b.col_3'
        )\
        .union(dfB)\
        .dropDuplicates()\
        .sort('col_1')\
        .show()
    #+-----+-----+-----+
    #|col_1|col_2|col_3|
    #+-----+-----+-----+
    #|    a|  wew|    1|
    #|    b|  eee| null|
    #|    c|  rer|    3|
    #|    d|  yyy|    2|
    #+-----+-----+-----+
    

    Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:

    cols_to_update = ['col_2']
    
    dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
        .select(
            *[
                ['col_1'] + 
                [
                    f.when(
                        ~f.isnull(f.col('b.{}'.format(c))),
                        f.col('b.{}'.format(c))
                    ).otherwise(f.col('a.{}'.format(c))).alias(c)
                    for c in cols_to_update
                ] + 
                ['b.col_3']
            ]
        )\
        .union(dfB)\
        .dropDuplicates()\
        .sort('col_1')\
        .show()
    

提交回复
热议问题