How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

試著忘記壹切 提交于 2019-11-26 22:20:48

问题


import numpy as np

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
    ('session', "timestamp1", "id2"))

Expected output

dataframe with count of nan/null for each column

Note: The previous questions I found in stack overflow only checks for null & not nan. Thats why i have created a new question.

I know i can use isnull() function in spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?


回答1:


You can use method shown here and replace isNull with isnan:

from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  3|
+-------+----------+---+

or

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  5|
+-------+----------+---+



回答2:


You can create an UDF to ckeck both null and NaN and return the boolean value to filter

The code is scala code hope you can convert to python.

val isNaN = udf((value : Float) => {
   if (value.equals(Float.NaN) || value == null) true else false }) 

val result = data.filter(isNaN(data("column2"))).count()

Hope this helps !




回答3:


Here is my one liner. Here 'c' is the name of the column

df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()


来源:https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!