问题
I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.
df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]
DF for example:
col1 col2 col3
23 45 NaN
54 39 NaN
NaN 45 76
87 32 NaN
The output should look like:
df.dropna(axis=1, thresh=2)
col1 col2
23 45
54 39
NaN 45
87 32
回答1:
You can create with non-NaN columns using
df = df[df.columns[~df.isnull().all()]]
Or
null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)
If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null
cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
回答2:
I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.
I can think in the following code:
import pandas as pd
df = pd.DataFrame({
'col1': [23, 54, pd.np.nan, 87],
'col2': [45, 39, 45, 32],
'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})
# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
return col.isnull().sum() > threshold
# Then you apply the "complement" of function to get the column with
# no NaN.
df.loc[:, ~df.apply(has_nan)]
# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
回答3:
you should try df_notnull = df.dropna(how='all')
This will get you only non null rows.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
回答4:
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values
df = df[full_col_series.index]
回答5:
df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.
df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.
df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.
df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!
回答6:
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
来源:https://stackoverflow.com/questions/47414848/pandas-select-all-columns-without-nan