问题
I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull()
on the dataframe:
df = df[df.isnull().any(axis=1)]
But in case of PySpark, when I am running below command it shows Attributeerror:
df.filter(df.isNull())
AttributeError: 'DataFrame' object has no attribute 'isNull'.
How can get the rows with null values without checking it for each column?
回答1:
You can filter the rows with where
, reduce
and a list comprehension. For example, given the following dataframe:
df = sc.parallelize([
(0.4, 0.3),
(None, 0.11),
(9.7, None),
(None, None)
]).toDF(["A", "B"])
df.show()
+----+----+
| A| B|
+----+----+
| 0.4| 0.3|
|null|0.11|
| 9.7|null|
|null|null|
+----+----+
Filtering the rows with some null
value could be achieved with:
import pyspark.sql.functions as f
from functools import reduce
df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).show()
Which gives:
+----+----+
| A| B|
+----+----+
|null|0.11|
| 9.7|null|
|null|null|
+----+----+
In the condition statement you have to specify if any (or, |), all (and, &), etc.
回答2:
This is how you can do this in scala
import org.apache.spark.sql.functions._
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
display(df1.filter(df1.columns.map(c => col(c).isNull).reduce((a,b) => a || b)))
来源:https://stackoverflow.com/questions/53486981/how-to-return-rows-with-null-values-in-pyspark-dataframe