I am totally new to Machine Learning and I have been working with unsupervised learning technique.
Image shows my sample Data(After all Cleaning) Screenshot : Sample
To perform one-hot encoding for multiple categorical features, we can create a new class which customizes our own multiple categorical features binarizer and plug it into categorical pipeline as follows.
Suppose CAT_FEATURES = ['cat_feature1', 'cat_feature2'] is a list of categorical features. The following scripts shall resolve the issue and produce what we want.
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
"""Perform one-hot encoding to categorical features."""
def __init__(self, cat_features):
self.cat_features = cat_features
def fit(self, X_cat, y=None):
return self
def transform(self, X_cat):
X_cat_df = pd.DataFrame(X_cat, columns=self.cat_features)
X_onehot_df = pd.get_dummies(X_cat_df, columns=self.cat_features)
return X_onehot_df.values
# Pipeline for categorical features.
cat_pipeline = Pipeline([
('selector', DataFrameSelector(CAT_FEATURES)),
('onehot_encoder', CustomLabelBinarizer(CAT_FEATURES))
])