How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

后端 未结 5 2031
广开言路
广开言路 2020-11-28 21:35
import numpy as np

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float(\'nan\')), (1, 6,          


        
相关标签:
5条回答
  • 2020-11-28 21:51

    You can use method shown here and replace isNull with isnan:

    from pyspark.sql.functions import isnan, when, count, col
    
    df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
    +-------+----------+---+
    |session|timestamp1|id2|
    +-------+----------+---+
    |      0|         0|  3|
    +-------+----------+---+
    

    or

    df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
    +-------+----------+---+
    |session|timestamp1|id2|
    +-------+----------+---+
    |      0|         0|  5|
    +-------+----------+---+
    
    0 讨论(0)
  • 2020-11-28 21:55

    For null values in the dataframe of pyspark

    Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
    Dict_Null
    
    # The output in dict where key is column name and value is null values in that column
    
    {'#': 0,
     'Name': 0,
     'Type 1': 0,
     'Type 2': 386,
     'Total': 0,
     'HP': 0,
     'Attack': 0,
     'Defense': 0,
     'Sp_Atk': 0,
     'Sp_Def': 0,
     'Speed': 0,
     'Generation': 0,
     'Legendary': 0}
    
    0 讨论(0)
  • 2020-11-28 21:57

    An alternative to the already provided ways is to simply filter on the column like so

    df = df.where(F.col('columnNameHere').isNull())
    

    This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

    0 讨论(0)
  • 2020-11-28 22:02

    Here is my one liner. Here 'c' is the name of the column

    df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()
    
    0 讨论(0)
  • 2020-11-28 22:04

    To make sure it does not fail for string, date and timestamp columns:

    import pyspark.sql.functions as F
    def count_missings(spark_df,sort=True):
        """
        Counts number of nulls and nans in each column
        """
        df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
    
        if len(df) == 0:
            print("There are no any missing values!")
            return None
    
        if sort:
            return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
    
        return df
    

    If you want to see the columns sorted based on the number of nans and nulls in descending:

    count_missings(spark_df)
    
    # | Col_A | 10 |
    # | Col_C | 2  |
    # | Col_B | 1  | 
    

    If you don't want ordering and see them as a single row:

    count_missings(spark_df, False)
    # | Col_A | Col_B | Col_C |
    # |  10   |   1   |   2   |
    
    0 讨论(0)
提交回复
热议问题