Removing duplicate columns after a DF join in Spark

后端 未结 7 685
小鲜肉
小鲜肉 2020-12-24 05:46

When you join two DFs with similar column names:

df = df1.join(df2, df1[\'id\'] == df2[\'id\'])

Join works fine but you can\'t call the

相关标签:
7条回答
  • 2020-12-24 06:20

    If the join columns at both data frames have the same names and you only need equi join, you can specify the join columns as a list, in which case the result will only keep one of the join columns:

    df1.show()
    +---+----+
    | id|val1|
    +---+----+
    |  1|   2|
    |  2|   3|
    |  4|   4|
    |  5|   5|
    +---+----+
    
    df2.show()
    +---+----+
    | id|val2|
    +---+----+
    |  1|   2|
    |  1|   3|
    |  2|   4|
    |  3|   5|
    +---+----+
    
    df1.join(df2, ['id']).show()
    +---+----+----+
    | id|val1|val2|
    +---+----+----+
    |  1|   2|   2|
    |  1|   2|   3|
    |  2|   3|   4|
    +---+----+----+
    

    Otherwise you need to give the join data frames alias and refer to the duplicated columns by the alias later:

    df1.alias("a").join(
        df2.alias("b"), df1['id'] == df2['id']
    ).select("a.id", "a.val1", "b.val2").show()
    +---+----+----+
    | id|val1|val2|
    +---+----+----+
    |  1|   2|   2|
    |  1|   2|   3|
    |  2|   3|   4|
    +---+----+----+
    
    0 讨论(0)
提交回复
热议问题