remove NULL columns in Spark SQL

前端 未结 2 706
误落风尘
误落风尘 2020-12-12 02:19

How to remove columns containing only null values from a table? Suppose I have a table -

SnapshotDate    CreationDate    Country Region  CloseDate   Probabi         


        
2条回答
  •  抹茶落季
    2020-12-12 02:55

    I solved this with a global groupBy. This works for numeric and non-numeric columns:

    case class Entry(id: Long, name: String, value: java.lang.Float)
    
    val results = Seq(
      Entry(10, null, null),
      Entry(10, null, null),
      Entry(20, null, null)
    )
    
    val df: DataFrame = spark.createDataFrame(results)
    
    // mark all columns with null only
    val row = df
      .select(df.columns.map(c => when(col(c).isNull, 0).otherwise(1).as(c)): _*)
      .groupBy().max(df.columns.map(c => c): _*)
      .first
    
    // and filter the columns out
    val colKeep = row.getValuesMap[Int](row.schema.fieldNames)
      .map{c => if (c._2 == 1) Some(c._1) else None }
      .flatten.toArray
    df.select(row.schema.fieldNames.intersect(colKeep)
      .map(c => col(c.drop(4).dropRight(1))): _*).show(false)
    
    +---+
    |id |
    +---+
    |10 |
    |10 |
    |20 |
    +---+
    

    Edit: I removed the shuffling of columns. The new approach keeps the given order of the columns.

提交回复
热议问题