spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

问题

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.

I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.

df1:

ID col1 col2
1  AA   2019
2  B    2018

df2:

ID col1 col2
1  A    2019
3  C    2017

I need the following output:

df1:

ID col1 col2
1  AA   2019
2  B    2018
3  C    2017

How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.

回答1:

Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:

df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// |  1| AA|2019|
// |  2|  B|2018|
// |  3|  C|2017|
// +---+---+----+

回答2:

One way to do this is, unioning the dataframes with an identifier column that specifies the dataframe and use it thereafter for prioritizing row from df1 with a function like row_number.

PySpark SQL solution shown here.

from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()

A solution with a left join on df1 could be a lot simpler, except that you have to write multiple coalesces.

来源：https://stackoverflow.com/questions/57838825/spark-merge-two-dataframes-if-id-duplicated-in-two-dataframes-the-row-in-df1

标签

scala

dataframe

apache-spark

apache-spark-sql