Imputing missing values using sklearn IterativeImputer class for MICE

…衆ロ難τιáo~ 提交于 2021-02-08 04:57:29


I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs:

Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True

I've seen "seeds" being used in different pipelines, but I never understood them well enough to implement them in my own code. I was wondering if anyone could explain and provide an example on how to implement seeds for a MICE imputation using sklearn's IterativeImputer? Thanks!


IterativeImputer behavior can change depending on a random state. The random state which can be set is also called a "seed".

As stated by the documentation, we can get multiple imputations when setting sample_posterior to True and changing the random seeds, i.e. the parameter random_state.

Here is an example of how to use it:

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X_train = [[1, 2],
           [3, 6],
           [4, 8],
           [np.nan, 3],
           [7, np.nan]]
X_test = [[np.nan, 2],
          [np.nan, np.nan],
          [np.nan, 6]]

for i in range(3):
    imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True)
    print(f"imputation {i}:")

It outputs:

imputation 0:
[[ 1.  2.]
 [ 5. 10.]
 [ 3.  6.]]
imputation 1:
[[1. 2.]
 [0. 1.]
 [3. 6.]]
imputation 2:
[[1. 2.]
 [1. 2.]
 [3. 6.]]

We can observe the three different imputations.

