How to drop rows with too many NULL values?

问题

I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value).

For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.

I found some related topics but I cannot find any useful information for my purpose.

stackoverflow.com/questions/3473778/count-number-of-nulls-in-a-row

Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot write the column names and do something accordingly.

So is there anyway to do this delete operation without using the column names in Apache Spark with scala?

回答1:

Test date:

case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(new Document(null, null, null), new Document("a", null, null), new Document("a", "b", null), new Document("a", "b", "c"), new Document(null, null, "c"))).df

With UDF

Remixing the answer by David and my RDD version below, you can do it using a UDF that takes a row:

def nullFilter = udf((x:Row) => {Range(0, x.length).count(x.isNullAt(_)) < 2})
df.filter(nullFilter(struct(df.columns.map(df(_)) : _*))).show

With RDD

You could turn it into a rdd, loop of the columns in the Row and count how many are null.

sqlContext.createDataFrame(df.rdd.filter( x=> Range(0, x.length).count(x.isNullAt(_)) < 2 ), df.schema).show

回答2:

I'm surprised that no answers pointed out that Spark SQL comes with few standard functions that meet the requirement:

For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.

You could use one of the variants of DataFrameNaFunctions.drop method with minNonNulls set appropriately, say 2.

drop(minNonNulls: Int, cols: Seq[String]): DataFrame Returns a new DataFrame that drops rows containing less than minNonNulls non-null and non-NaN values in the specified columns.

And to meet the variability in the column names as in the requirement:

I cannot write the column names and do something accordingly.

You can simply use Dataset.columns:

columns: Array[String] Returns all column names as an array.

Let say you've got the following dataset with 5 features (columns) and few rows almost all nulls.

val ns: String = null
val features = Seq(("0","1","2",ns,ns), (ns, ns, ns, ns, ns), (ns, "1", ns, "2", ns)).toDF
scala> features.show
+----+----+----+----+----+
|  _1|  _2|  _3|  _4|  _5|
+----+----+----+----+----+
|   0|   1|   2|null|null|
|null|null|null|null|null|
|null|   1|null|   2|null|
+----+----+----+----+----+

// drop rows with more than (5 columns - 2) = 3 nulls
scala> features.na.drop(2, features.columns).show
+----+---+----+----+----+
|  _1| _2|  _3|  _4|  _5|
+----+---+----+----+----+
|   0|  1|   2|null|null|
|null|  1|null|   2|null|
+----+---+----+----+----+

回答3:

It's cleaner with a UDF:

import org.apache.spark.sql.functions.udf
def countNulls = udf((v: Any) => if (v == null) 1; else 0;))
df.registerTempTable("foo")

sqlContext.sql(
  "select " + df.columns.mkString(", ") + ", " + df.columns.map(c => {
    "countNulls(" + c + ")"
  }).mkString(" + ") + "as nullCount from foo"
).filter($"nullCount" > 8).show

If making query strings makes you nervous, then you can try this:

var countCol: org.apache.spark.sql.Column = null
df.columns.foreach(c => {
  if (countCol == null) countCol = countNulls(col(c))
  else countCol = countCol + countNulls(col(c)) 
});

df.select(Seq(countCol as "nullCount") ++ df.columns.map(c => col(c)):_*)
  .filter($"nullCount" > 8)

回答4:

Here is an alternative in Spark 2.0:

val df = Seq((null,"A"),(null,"B"),("1","C"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Int"))

df.show()

+----+---+
| foo|bar|
+----+---+
|null|  A|
|null|  B|
|   1|  C|
+----+---+

df.where('foo.isNull).groupBy('foo).count().show()

+----+-----+
| foo|count|
+----+-----+
|null|    2|
+----+-----+

来源：https://stackoverflow.com/questions/36062908/how-to-drop-rows-with-too-many-null-values

标签

scala

apache-spark

dataframe

apache-spark-sql