remove NULL columns in Spark SQL

前端 未结 2 702
误落风尘
误落风尘 2020-12-12 02:19

How to remove columns containing only null values from a table? Suppose I have a table -

SnapshotDate    CreationDate    Country Region  CloseDate   Probabi         


        
相关标签:
2条回答
  • 2020-12-12 02:51

    You can add custom udf, and it in Spark SQL.

    sqlContext.udf.register("ISNOTNULL", (str: String) => Option(str).getOrElse(""))
    

    And with Spark SQL you can do :

    SELECT ISNOTNULL(Probability) Probability, ISNOTNULL(BookingAmount) BookingAmount, ISNOTNULL(RevenueAmount) RevenueAmount FROM df
    
    0 讨论(0)
  • 2020-12-12 02:55

    I solved this with a global groupBy. This works for numeric and non-numeric columns:

    case class Entry(id: Long, name: String, value: java.lang.Float)
    
    val results = Seq(
      Entry(10, null, null),
      Entry(10, null, null),
      Entry(20, null, null)
    )
    
    val df: DataFrame = spark.createDataFrame(results)
    
    // mark all columns with null only
    val row = df
      .select(df.columns.map(c => when(col(c).isNull, 0).otherwise(1).as(c)): _*)
      .groupBy().max(df.columns.map(c => c): _*)
      .first
    
    // and filter the columns out
    val colKeep = row.getValuesMap[Int](row.schema.fieldNames)
      .map{c => if (c._2 == 1) Some(c._1) else None }
      .flatten.toArray
    df.select(row.schema.fieldNames.intersect(colKeep)
      .map(c => col(c.drop(4).dropRight(1))): _*).show(false)
    
    +---+
    |id |
    +---+
    |10 |
    |10 |
    |20 |
    +---+
    

    Edit: I removed the shuffling of columns. The new approach keeps the given order of the columns.

    0 讨论(0)
提交回复
热议问题