Impute categorical missing values in scikit-learn

后端 未结 10 1422
清歌不尽
清歌不尽 2020-11-30 16:55

I\'ve got pandas data with some columns of text type. There are some NaN values along with these text columns. What I\'m trying to do is to impute those NaN\'s by skle

10条回答
  •  感动是毒
    2020-11-30 17:26

    To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

    import pandas as pd
    import numpy as np
    
    from sklearn.base import TransformerMixin
    
    class DataFrameImputer(TransformerMixin):
    
        def __init__(self):
            """Impute missing values.
    
            Columns of dtype object are imputed with the most frequent value 
            in column.
    
            Columns of other types are imputed with mean of column.
    
            """
        def fit(self, X, y=None):
    
            self.fill = pd.Series([X[c].value_counts().index[0]
                if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
                index=X.columns)
    
            return self
    
        def transform(self, X, y=None):
            return X.fillna(self.fill)
    
    data = [
        ['a', 1, 2],
        ['b', 1, 1],
        ['b', 2, 2],
        [np.nan, np.nan, np.nan]
    ]
    
    X = pd.DataFrame(data)
    xt = DataFrameImputer().fit_transform(X)
    
    print('before...')
    print(X)
    print('after...')
    print(xt)
    

    which prints,

    before...
         0   1   2
    0    a   1   2
    1    b   1   1
    2    b   2   2
    3  NaN NaN NaN
    after...
       0         1         2
    0  a  1.000000  2.000000
    1  b  1.000000  1.000000
    2  b  2.000000  2.000000
    3  b  1.333333  1.666667
    

提交回复
热议问题