I\'ve got pandas data with some columns of text type. There are some NaN values along with these text columns. What I\'m trying to do is to impute those NaN\'s by skle
You can use sklearn_pandas.CategoricalImputer
for the categorical columns. Details:
First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform()
takes a pandas DataFrame):
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion
, for example:
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline)
])
Now, in the num_pipeline
you can simply use sklearn.preprocessing.Imputer()
, but in the cat_pipline
, you can use CategoricalImputer()
from the sklearn_pandas
package.
note: sklearn-pandas
package can be installed with pip install sklearn-pandas
, but it is imported as import sklearn_pandas