How to update a pyspark dataframe with new values from another dataframe?

前端 未结 3 1471
迷失自我
迷失自我 2020-12-19 16:04

I have two spark dataframes:

Dataframe A:

|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |

and dataframe B:

         


        
3条回答
  •  长情又很酷
    2020-12-19 16:53

    I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.

        replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
        resultDf = dfA.subtract(replaceDf).union(dfB).show()
    

    Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.

提交回复
热议问题