Concatenate two PySpark dataframes

后端 未结 10 1365
独厮守ぢ
独厮守ぢ 2020-12-02 16:28

I\'m trying to concatenate two PySpark dataframes with some columns that are only on each of them:

from pyspark.sql.functions import randn, rand

df_1 = sqlC         


        
10条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-02 17:12

    This should do it for you ...

    from pyspark.sql.types import FloatType
    from pyspark.sql.functions import randn, rand, lit, coalesce, col
    import pyspark.sql.functions as F
    
    df_1 = sqlContext.range(0, 6)
    df_2 = sqlContext.range(3, 10)
    df_1 = df_1.select("id", lit("old").alias("source"))
    df_2 = df_2.select("id")
    
    df_1.show()
    df_2.show()
    df_3 = df_1.alias("df_1").join(df_2.alias("df_2"), df_1.id == df_2.id, "outer")\
      .select(\
        [coalesce(df_1.id, df_2.id).alias("id")] +\
        [col("df_1." + c) for c in df_1.columns if c != "id"])\
      .sort("id")
    df_3.show()
    

提交回复
热议问题