How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

后端 未结 5 2051
广开言路
广开言路 2020-11-28 21:35
import numpy as np

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float(\'nan\')), (1, 6,          


        
5条回答
  •  猫巷女王i
    2020-11-28 21:51

    You can use method shown here and replace isNull with isnan:

    from pyspark.sql.functions import isnan, when, count, col
    
    df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
    +-------+----------+---+
    |session|timestamp1|id2|
    +-------+----------+---+
    |      0|         0|  3|
    +-------+----------+---+
    

    or

    df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
    +-------+----------+---+
    |session|timestamp1|id2|
    +-------+----------+---+
    |      0|         0|  5|
    +-------+----------+---+
    

提交回复
热议问题