How to delete a column in pandas dataframe based on a condition?

前端 未结 2 1065
猫巷女王i
猫巷女王i 2020-12-09 19:39

I have a pandas DataFrame, with many NAN values in it.

How can I drop columns such that number_of_na_values > 2000?

I tried to

2条回答
  •  庸人自扰
    2020-12-09 20:40

    Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:

    max_number_of_nas = 3000
    df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]
    

    In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested:

    np.random.seed(0)
    df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
    df[df < 0] = np.nan
    max_number_of_nans = 5010
    
    %timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
    >> 1000 loops, best of 3: 1.76 ms per loop
    
    %%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
    >> 100 loops, best of 3: 2.04 ms per loop
    
    
    np.random.seed(0)
    df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
    df[df < 0] = np.nan
    max_number_of_nans = 5
    
    %timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
    >> 1000 loops, best of 3: 662 µs per loop
    
    %%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
    >> 1000 loops, best of 3: 1.08 ms per loop
    

提交回复
热议问题