How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?

后端 未结 5 2077
滥情空心
滥情空心 2020-11-28 00:37

I have a dataframe with ~300K rows and ~40 columns. I want to find out if any rows contain null values - and put these \'null\'-rows into a separate dataframe so that I coul

5条回答
  •  自闭症患者
    2020-11-28 01:06

    .any() and .all() are great for the extreme cases, but not when you're looking for a specific number of null values. Here's an extremely simple way to do what I believe you're asking. It's pretty verbose, but functional.

    import pandas as pd
    import numpy as np
    
    # Some test data frame
    df = pd.DataFrame({'num_legs':          [2, 4,      np.nan, 0, np.nan],
                       'num_wings':         [2, 0,      np.nan, 0, 9],
                       'num_specimen_seen': [10, np.nan, 1,     8, np.nan]})
    
    # Helper : Gets NaNs for some row
    def row_nan_sums(df):
        sums = []
        for row in df.values:
            sum = 0
            for el in row:
                if el != el: # np.nan is never equal to itself. This is "hacky", but complete.
                    sum+=1
            sums.append(sum)
        return sums
    
    # Returns a list of indices for rows with k+ NaNs
    def query_k_plus_sums(df, k):
        sums = row_nan_sums(df)
        indices = []
        i = 0
        for sum in sums:
            if (sum >= k):
                indices.append(i)
            i += 1
        return indices
    
    # test
    print(df)
    print(query_k_plus_sums(df, 2))
    

    Output

       num_legs  num_wings  num_specimen_seen
    0       2.0        2.0               10.0
    1       4.0        0.0                NaN
    2       NaN        NaN                1.0
    3       0.0        0.0                8.0
    4       NaN        9.0                NaN
    [2, 4]
    

    Then, if you're like me and want to clear those rows out, you just write this:

    # drop the rows from the data frame
    df.drop(query_k_plus_sums(df, 2),inplace=True)
    # Reshuffle up data (if you don't do this, the indices won't reset)
    df = df.sample(frac=1).reset_index(drop=True)
    # print data frame
    print(df)
    

    Output:

       num_legs  num_wings  num_specimen_seen
    0       4.0        0.0                NaN
    1       0.0        0.0                8.0
    2       2.0        2.0               10.0
    

提交回复
热议问题