Removing duplicate columns after a DF join in Spark

后端 未结 7 697
小鲜肉
小鲜肉 2020-12-24 05:46

When you join two DFs with similar column names:

df = df1.join(df2, df1[\'id\'] == df2[\'id\'])

Join works fine but you can\'t call the

7条回答
  •  一向
    一向 (楼主)
    2020-12-24 06:17

    Assuming 'a' is a dataframe with column 'id' and 'b' is another dataframe with column 'id'

    I use the following two methods to remove duplicates:

    Method 1: Using String Join Expression as opposed to boolean expression. This automatically remove a duplicate column for you

    a.join(b, 'id')
    

    Method 2: Renaming the column before the join and dropping it after

    b.withColumnRenamed('id', 'b_id')
    joinexpr = a['id'] == b['b_id']
    a.join(b, joinexpr).drop('b_id)
    

提交回复
热议问题