Imputation of missing values for categories in pandas

后端 未结 4 544
时光说笑
时光说笑 2020-12-04 18:03

The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?

In R randomForest package there is na.roughfix option : A

相关标签:
4条回答
  • 2020-12-04 18:24
    def fillna(col):
        col.fillna(col.value_counts().index[0], inplace=True)
        return col
    df=df.apply(lambda col:fillna(col))
    
    0 讨论(0)
  • 2020-12-04 18:25

    You can use df = df.fillna(df['Label'].value_counts().index[0]) to fill NaNs with the most frequent value from one column.

    If you want to fill every column with its own most frequent value you can use

    df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

    UPDATE 2018-25-10

    Starting from 0.13.1 pandas includes mode method for Series and Dataframes. You can use it to fill missing values for each column (using its own most frequent value) like this

    df = df.fillna(df.mode().iloc[0])
    
    0 讨论(0)
  • 2020-12-04 18:35

    In more recent versions of scikit-learn up you can use SimpleImputer to impute both numerics and categoricals:

    import pandas as pd
    from sklearn.impute import SimpleImputer
    arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
    df1 = pd.DataFrame({'x1': [x[0] for x in arr],
                        'x2': [x[1] for x in arr]},
                      index=[l for l in 'abcde'])
    imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    print(pd.DataFrame(imp.fit_transform(df1),
                       columns=df1.columns,
                       index=df1.index))
    #   x1 x2
    # a  1  x
    # b  7  y
    # c  7  z
    # d  7  y
    # e  4  y
    
    0 讨论(0)
  • 2020-12-04 18:46

    Most of the time, you wouldn't want the same imputing strategy for all the columns. For example, you may want column mode for categorical variables and column mean or median for numeric columns.

    For example:

    df = pd.DataFrame({'num': [1.,2.,4.,np.nan],'cate1':['a','a','b',np.nan],'cate2':['a','b','b',np.nan]})
    
    # numeric columns
    >>> df.fillna(df.select_dtypes(include='number').mean().iloc[0], inplace=True)
    
    # categorical columns
    >>> df.fillna(df.select_dtypes(include='object').mode().iloc[0], inplace=True)
    
    >>> print(df)
    
         num cate1 cate2
     0 1.000     a     a
     1 2.000     a     b
     2 4.000     b     b
     3 2.333     a     b
    
    0 讨论(0)
提交回复
热议问题