Concatenate two PySpark dataframes

后端 未结 10 1356
独厮守ぢ
独厮守ぢ 2020-12-02 16:28

I\'m trying to concatenate two PySpark dataframes with some columns that are only on each of them:

from pyspark.sql.functions import randn, rand

df_1 = sqlC         


        
10条回答
  •  伪装坚强ぢ
    2020-12-02 17:17

    To make it more generic of keeping both columns in df1 and df2:

    import pyspark.sql.functions as F
    
    # Keep all columns in either df1 or df2
    def outter_union(df1, df2):
    
        # Add missing columns to df1
        left_df = df1
        for column in set(df2.columns) - set(df1.columns):
            left_df = left_df.withColumn(column, F.lit(None))
    
        # Add missing columns to df2
        right_df = df2
        for column in set(df1.columns) - set(df2.columns):
            right_df = right_df.withColumn(column, F.lit(None))
    
        # Make sure columns are ordered the same
        return left_df.union(right_df.select(left_df.columns))
    

提交回复
热议问题