How to delete a column in pandas dataframe based on a condition?

前端未结

关注

 2  1070

猫巷女王i

I have a pandas DataFrame, with many NAN values in it.

How can I drop columns such that number_of_na_values > 2000?

I tried to

相关标签:

2条回答

既然无缘

2020-12-09 20:14

Same logic, but just put all things in one line.

import pandas as pd
import numpy as np

# artificial data
# ====================================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDE'))
df[df < 0] = np.nan

        A       B       C       D       E
0  1.7641  0.4002  0.9787  2.2409  1.8676
1     NaN  0.9501     NaN     NaN  0.4106
2  0.1440  1.4543  0.7610  0.1217  0.4439
3  0.3337  1.4941     NaN  0.3131     NaN
4     NaN  0.6536  0.8644     NaN  2.2698
5     NaN  0.0458     NaN  1.5328  1.4694
6  0.1549  0.3782     NaN     NaN     NaN
7  0.1563  1.2303  1.2024     NaN     NaN
8     NaN     NaN     NaN  1.9508     NaN
9     NaN     NaN  0.7775     NaN     NaN

# processing: drop columns with no. of NaN > 3
# ====================================
df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > 3)], axis=1)


Out[183]:
        B
0  0.4002
1  0.9501
2  1.4543
3  1.4941
4  0.6536
5  0.0458
6  0.3782
7  1.2303
8     NaN
9     NaN

0 讨论(0)

庸人自扰

2020-12-09 20:40

Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:

max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]

In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested:

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 1.76 ms per loop

%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 100 loops, best of 3: 2.04 ms per loop


np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 662 µs per loop

%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1000 loops, best of 3: 1.08 ms per loop

0 讨论(0)