发表新帖

发表新帖

Removing duplicate columns after a DF join in Spark

后端未结

关注

 7  707

小鲜肉 2020-12-24 05:46

When you join two DFs with similar column names:

df = df1.join(df2, df1[\'id\'] == df2[\'id\'])

Join works fine but you can\'t call the

7条回答

星月不相逢 (楼主)

2020-12-24 06:06
In my case I had a dataframe with multiple duplicate columns after joins and I was trying to same that dataframe in csv format, but due to duplicate column I was getting error. I followed below steps to drop duplicate columns. Code is in scala

1) Rename all the duplicate columns and make new dataframe 2) make separate list for all the renamed columns 3) Make new dataframe with all columns (including renamed - step 1) 4) drop all the renamed column
```
private def removeDuplicateColumns(dataFrame:DataFrame): DataFrame = {
var allColumns:  mutable.MutableList[String] = mutable.MutableList()
val dup_Columns: mutable.MutableList[String] = mutable.MutableList()
dataFrame.columns.foreach((i: String) =>{
if(allColumns.contains(i))

if(allColumns.contains(i))
{allColumns += "dup_" + i
dup_Columns += "dup_" +i
}else{
allColumns += i
}println(i)
})
val columnSeq = allColumns.toSeq
val df = dataFrame.toDF(columnSeq:_*)
val unDF = df.drop(dup_Columns:_*)
unDF
}
```
to call the above function use below code and pass your dataframe which contains duplicate columns
```
val uniColDF = removeDuplicateColumns(df)
```
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题