Removing duplicate columns after a DF join in Spark

后端 未结 7 689
小鲜肉
小鲜肉 2020-12-24 05:46

When you join two DFs with similar column names:

df = df1.join(df2, df1[\'id\'] == df2[\'id\'])

Join works fine but you can\'t call the

7条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-24 06:18

    The code below works with Spark 1.6.0 and above.

    salespeople_df.show()
    +---+------+-----+
    |Num|  Name|Store|
    +---+------+-----+
    |  1| Henry|  100|
    |  2| Karen|  100|
    |  3|  Paul|  101|
    |  4| Jimmy|  102|
    |  5|Janice|  103|
    +---+------+-----+
    
    storeaddress_df.show()
    +-----+--------------------+
    |Store|             Address|
    +-----+--------------------+
    |  100|    64 E Illinos Ave|
    |  101|         74 Grand Pl|
    |  102|          2298 Hwy 7|
    |  103|No address available|
    +-----+--------------------+
    

    Assuming -in this example- that the name of the shared column is the same:

    joined=salespeople_df.join(storeaddress_df, ['Store'])
    joined.orderBy('Num', ascending=True).show()
    
    +-----+---+------+--------------------+
    |Store|Num|  Name|             Address|
    +-----+---+------+--------------------+
    |  100|  1| Henry|    64 E Illinos Ave|
    |  100|  2| Karen|    64 E Illinos Ave|
    |  101|  3|  Paul|         74 Grand Pl|
    |  102|  4| Jimmy|          2298 Hwy 7|
    |  103|  5|Janice|No address available|
    +-----+---+------+--------------------+
    

    .join will prevent the duplication of the shared column.

    Let's assume that you want to remove the column Num in this example, you can just use .drop('colname')

    joined=joined.drop('Num')
    joined.show()
    
    +-----+------+--------------------+
    |Store|  Name|             Address|
    +-----+------+--------------------+
    |  103|Janice|No address available|
    |  100| Henry|    64 E Illinos Ave|
    |  100| Karen|    64 E Illinos Ave|
    |  101|  Paul|         74 Grand Pl|
    |  102| Jimmy|          2298 Hwy 7|
    +-----+------+--------------------+
    

提交回复
热议问题