I\'ve got pandas data with some columns of text type. There are some NaN values along with these text columns. What I\'m trying to do is to impute those NaN\'s by skle
strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative. This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.
from sklearn.preprocessing import Imputer
impute = Imputer(strategy='mean')
for cols in ['quantitative_column', 'quant']: # here both are quantitative features.
xx[cols] = impute.fit_transform(xx[[cols]])
Custom Imputer :
from sklearn.preprocessing import Imputer
from sklearn.base import TransformerMixin
class CustomImputer(TransformerMixin):
def __init__(self, cols=None, strategy='mean'):
self.cols = cols
self.strategy = strategy
def transform(self, df):
X = df.copy()
impute = Imputer(strategy=self.strategy)
if self.cols == None:
self.cols = list(X.columns)
for col in self.cols:
if X[col].dtype == np.dtype('O') :
X[col].fillna(X[col].value_counts().index[0], inplace=True)
else : X[col] = impute.fit_transform(X[[col]])
return X
def fit(self, *_):
return self
Dataframe:
X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san
francisco', 'tokyo'],
'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'],
'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like',
'somewhat like', 'dislike'],
'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
city boolean ordinal_column quantitative_column
0 tokyo yes somewhat like 1.0
1 NaN no like 11.0
2 london NaN somewhat like -0.5
3 seattle no like 10.0
4 san francisco no somewhat like NaN
5 tokyo yes dislike 20.0
1) Can be used with list of similar type of features.
cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
cci.fit_transform(X)
can be used with strategy = median
sd = CustomImputer(['quantitative_column'], strategy = 'median')
sd.fit_transform(X)
3) Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.
call = CustomImputer()
call.fit_transform(X)