How to delete columns in pyspark dataframe

前端 未结 8 1691
滥情空心
滥情空心 2021-01-30 01:55
>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigi         


        
8条回答
  •  甜味超标
    2021-01-30 02:40

    Consider 2 dataFrames:

    >>> aDF.show()
    +---+----+
    | id|datA|
    +---+----+
    |  1|  a1|
    |  2|  a2|
    |  3|  a3|
    +---+----+
    

    and

    >>> bDF.show()
    +---+----+
    | id|datB|
    +---+----+
    |  2|  b2|
    |  3|  b3|
    |  4|  b4|
    +---+----+
    

    To accomplish what you are looking for, there are 2 ways:

    1. Different joining condition. Instead of saying aDF.id == bDF.id

    aDF.join(bDF, aDF.id == bDF.id, "outer")
    

    Write this:

    aDF.join(bDF, "id", "outer").show()
    +---+----+----+
    | id|datA|datB|
    +---+----+----+
    |  1|  a1|null|
    |  3|  a3|  b3|
    |  2|  a2|  b2|
    |  4|null|  b4|
    +---+----+----+
    

    This will automatically get rid of the extra the dropping process.

    2. Use Aliasing: You will lose data related to B Specific Id's in this.

    >>> from pyspark.sql.functions import col
    >>> aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
    
    +----+----+----+
    |  id|datA|datB|
    +----+----+----+
    |   1|  a1|null|
    |   3|  a3|  b3|
    |   2|  a2|  b2|
    |null|null|  b4|
    +----+----+----+
    

提交回复
热议问题