Pandas select all columns without NaN

*爱你&永不变心* 提交于 2019-12-14 03:59:46

问题


I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.

df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]

DF for example:

col1  col2 col3
23     45  NaN
54     39  NaN
NaN    45  76
87     32  NaN

The output should look like:

 df.dropna(axis=1, thresh=2)

    col1  col2
    23     45  
    54     39  
    NaN    45  
    87     32  

回答1:


You can create with non-NaN columns using

df = df[df.columns[~df.isnull().all()]]

Or

null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)

If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null

cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)



回答2:


I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.

I can think in the following code:

import pandas as pd

df = pd.DataFrame({
    'col1': [23, 54, pd.np.nan, 87],
    'col2': [45, 39, 45, 32],
    'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})

# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
    return col.isnull().sum() > threshold

# Then you apply the "complement" of function to get the column with
# no NaN.

df.loc[:, ~df.apply(has_nan)]

# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]



回答3:


you should try df_notnull = df.dropna(how='all') This will get you only non null rows.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html




回答4:


null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values

df = df[full_col_series.index]



回答5:


df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.

df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.

df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.

df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!




回答6:


Here is a simple function which you can use directly by passing dataframe and threshold

df
'''
     pets   location     owner     id
0     cat  San_Diego     Champ  123.0
1     dog        NaN       Ron    NaN
2     cat        NaN     Brick    NaN
3  monkey        NaN     Champ    NaN
4  monkey        NaN  Veronica    NaN
5     dog        NaN      John    NaN
'''

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values

#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

l = rmissingvaluecol(df,1)
df1 = df[l]

PS: You can change threshold as per your requirement

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner        0.00
pets         0.00
dtype: float64
'''


来源:https://stackoverflow.com/questions/47414848/pandas-select-all-columns-without-nan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!