发表新帖

发表新帖

How to update a pyspark dataframe with new values from another dataframe?

前端未结

关注

 3  1471

迷失自我 2020-12-19 16:04

I have two spark dataframes:

Dataframe A:

|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |

and dataframe B:

3条回答

长情又很酷 (楼主)

2020-12-19 16:53
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
```
    replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
    resultDf = dfA.subtract(replaceDf).union(dfB).show()
```
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题