问题
I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value).
For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.
I found some related topics but I cannot find any useful information for my purpose.
stackoverflow.com/questions/3473778/count-number-of-nulls-in-a-row
Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot write the column names and do something accordingly.
So is there anyway to do this delete operation without using the column names in Apache Spark with scala?
回答1:
Test date:
case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(new Document(null, null, null), new Document("a", null, null), new Document("a", "b", null), new Document("a", "b", "c"), new Document(null, null, "c"))).df
With UDF
Remixing the answer by David and my RDD version below, you can do it using a UDF that takes a row:
def nullFilter = udf((x:Row) => {Range(0, x.length).count(x.isNullAt(_)) < 2})
df.filter(nullFilter(struct(df.columns.map(df(_)) : _*))).show
With RDD
You could turn it into a rdd, loop of the columns in the Row and count how many are null.
sqlContext.createDataFrame(df.rdd.filter( x=> Range(0, x.length).count(x.isNullAt(_)) < 2 ), df.schema).show
回答2:
I'm surprised that no answers pointed out that Spark SQL comes with few standard functions that meet the requirement:
For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.
You could use one of the variants of DataFrameNaFunctions.drop method with minNonNulls
set appropriately, say 2.
drop(minNonNulls: Int, cols: Seq[String]): DataFrame Returns a new DataFrame that drops rows containing less than minNonNulls non-null and non-NaN values in the specified columns.
And to meet the variability in the column names as in the requirement:
I cannot write the column names and do something accordingly.
You can simply use Dataset.columns:
columns: Array[String] Returns all column names as an array.
Let say you've got the following dataset with 5 features (columns) and few rows almost all null
s.
val ns: String = null
val features = Seq(("0","1","2",ns,ns), (ns, ns, ns, ns, ns), (ns, "1", ns, "2", ns)).toDF
scala> features.show
+----+----+----+----+----+
| _1| _2| _3| _4| _5|
+----+----+----+----+----+
| 0| 1| 2|null|null|
|null|null|null|null|null|
|null| 1|null| 2|null|
+----+----+----+----+----+
// drop rows with more than (5 columns - 2) = 3 nulls
scala> features.na.drop(2, features.columns).show
+----+---+----+----+----+
| _1| _2| _3| _4| _5|
+----+---+----+----+----+
| 0| 1| 2|null|null|
|null| 1|null| 2|null|
+----+---+----+----+----+
回答3:
It's cleaner with a UDF:
import org.apache.spark.sql.functions.udf
def countNulls = udf((v: Any) => if (v == null) 1; else 0;))
df.registerTempTable("foo")
sqlContext.sql(
"select " + df.columns.mkString(", ") + ", " + df.columns.map(c => {
"countNulls(" + c + ")"
}).mkString(" + ") + "as nullCount from foo"
).filter($"nullCount" > 8).show
If making query strings makes you nervous, then you can try this:
var countCol: org.apache.spark.sql.Column = null
df.columns.foreach(c => {
if (countCol == null) countCol = countNulls(col(c))
else countCol = countCol + countNulls(col(c))
});
df.select(Seq(countCol as "nullCount") ++ df.columns.map(c => col(c)):_*)
.filter($"nullCount" > 8)
回答4:
Here is an alternative in Spark 2.0:
val df = Seq((null,"A"),(null,"B"),("1","C"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Int"))
df.show()
+----+---+
| foo|bar|
+----+---+
|null| A|
|null| B|
| 1| C|
+----+---+
df.where('foo.isNull).groupBy('foo).count().show()
+----+-----+
| foo|count|
+----+-----+
|null| 2|
+----+-----+
来源:https://stackoverflow.com/questions/36062908/how-to-drop-rows-with-too-many-null-values