Spark Get only columns that have one or more null values

问题

From a dataframe I want to get names of columns which contain at least one null value inside.

Considering the dataframe below:

val dataset = sparkSession.createDataFrame(Seq(
  (7, null, 18, 1.0),
  (8, "CA", null, 0.0),
  (9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")

I want to get column names 'Country' and 'Hour'.

id  country hour    clicked
7   null    18      1
8   "CA"    null    0
9   "NZ"    15      0

回答1:

this is one solution, but it's a bit awkward, I hope there is an easier way:

val cols = dataset.columns

val columnsToSelect = dataset
  // count null values (by summing up 1s if its null)
  .select(cols.map(c => (sum(when(col(c).isNull,1))>0).alias(c)):_*)
  .head() // collect result of aggregation
  .getValuesMap[Boolean](cols) // now get columns which are "true"
  .filter{case (c,hasNulls) => hasNulls}
  .keys.toSeq // and get the name of those columns


dataset
  .select(columnsToSelect.head,columnsToSelect.tail:_*)
  .show()
+-------+----+
|country|hour|
+-------+----+
|   null|  18|
|     CA|null|
|     NZ|  15|
+-------+----+

回答2:

A slight modification of this answer, comparing the counts per column to the number of rows:

import org.apache.spark.sql.functions.{count,col}
// Get number of rows
val nr_rows = dataset.count
// Get column indices
val col_inds = dataset.select(dataset.columns.map(c => count(col(c)).alias(c)): _*)
                 .collect()(0)
                 .toSeq.zipWithIndex
                 .filter(_._1 != nr_rows).map(_._2)
// Subset column names using the indices
col_inds.map(i => dataset.columns.apply(i))
Seq[String] = ArrayBuffer(country, hour)

来源：https://stackoverflow.com/questions/48261746/spark-get-only-columns-that-have-one-or-more-null-values

标签

scala

apache-spark

apache-spark-sql

spark-dataframe

apache-spark-mllib