Label encoding across multiple columns in scikit-learn

后端 未结 22 2289
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

22条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-11-22 09:49

    Following up on the comments raised on the solution of @PriceHardman I would propose the following version of the class:

    class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
    def __init__(self, cols=None):
        pdu._is_cols_input_valid(cols)
        self.cols = cols
        self.les = {col: LabelEncoder() for col in cols}
        self._is_fitted = False
    
    def transform(self, df, **transform_params):
        """
        Scaling ``cols`` of ``df`` using the fitting
    
        Parameters
        ----------
        df : DataFrame
            DataFrame to be preprocessed
        """
        if not self._is_fitted:
            raise NotFittedError("Fitting was not preformed")
        pdu._is_cols_subset_of_df_cols(self.cols, df)
    
        df = df.copy()
    
        label_enc_dict = {}
        for col in self.cols:
            label_enc_dict[col] = self.les[col].transform(df[col])
    
        labelenc_cols = pd.DataFrame(label_enc_dict,
            # The index of the resulting DataFrame should be assigned and
            # equal to the one of the original DataFrame. Otherwise, upon
            # concatenation NaNs will be introduced.
            index=df.index
        )
    
        for col in self.cols:
            df[col] = labelenc_cols[col]
        return df
    
    def fit(self, df, y=None, **fit_params):
        """
        Fitting the preprocessing
    
        Parameters
        ----------
        df : DataFrame
            Data to use for fitting.
            In many cases, should be ``X_train``.
        """
        pdu._is_cols_subset_of_df_cols(self.cols, df)
        for col in self.cols:
            self.les[col].fit(df[col])
        self._is_fitted = True
        return self
    

    This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.

提交回复
热议问题