Removing duplicate columns after a DF join in Spark

后端 未结 7 711
小鲜肉
小鲜肉 2020-12-24 05:46

When you join two DFs with similar column names:

df = df1.join(df2, df1[\'id\'] == df2[\'id\'])

Join works fine but you can\'t call the

7条回答
  •  我在风中等你
    2020-12-24 05:54

    After I've joined multiple tables together, I run them through a simple function to drop columns in the DF if it encounters duplicates while walking from left to right. Alternatively, you could rename these columns too.

    Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.

    Names = sparkSession.sql("SELECT * FROM Names")
    Dates = sparkSession.sql("SELECT * FROM Dates")
    NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
    NamesAndDates = dropDupeDfCols(NamesAndDates)
    NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
    

    Where dropDupeDfCols is defined as:

    def dropDupeDfCols(df):
        newcols = []
        dupcols = []
    
        for i in range(len(df.columns)):
            if df.columns[i] not in newcols:
                newcols.append(df.columns[i])
            else:
                dupcols.append(i)
    
        df = df.toDF(*[str(i) for i in range(len(df.columns))])
        for dupcol in dupcols:
            df = df.drop(str(dupcol))
    
        return df.toDF(*newcols)
    

    The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Date'].

提交回复
热议问题