Imputation of missing values for categories in pandas

后端 未结 4 551
时光说笑
时光说笑 2020-12-04 18:03

The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?

In R randomForest package there is na.roughfix option : A

4条回答
  •  被撕碎了的回忆
    2020-12-04 18:35

    In more recent versions of scikit-learn up you can use SimpleImputer to impute both numerics and categoricals:

    import pandas as pd
    from sklearn.impute import SimpleImputer
    arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
    df1 = pd.DataFrame({'x1': [x[0] for x in arr],
                        'x2': [x[1] for x in arr]},
                      index=[l for l in 'abcde'])
    imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    print(pd.DataFrame(imp.fit_transform(df1),
                       columns=df1.columns,
                       index=df1.index))
    #   x1 x2
    # a  1  x
    # b  7  y
    # c  7  z
    # d  7  y
    # e  4  y
    

提交回复
热议问题