Label encoding across multiple columns in scikit-learn

后端 未结 22 2293
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

22条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-11-22 09:39

    I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).

    Very Rough ideas... first, identify which columns needed LabelEncoder, then loop through each column.

    def cat_var(df): 
        """Identify categorical features. 
    
        Parameters
        ----------
        df: original df after missing operations 
    
        Returns
        -------
        cat_var_df: summary df with col index and col name for all categorical vars
        """
        col_type = df.dtypes
        col_names = list(df)
    
        cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
        cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]
    
        cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 
                                   'cat_name': cat_var_name})
    
        return cat_var_df
    
    
    
    from sklearn.preprocessing import LabelEncoder 
    
    def column_encoder(df, cat_var_list):
        """Encoding categorical feature in the dataframe
    
        Parameters
        ----------
        df: input dataframe 
        cat_var_list: categorical feature index and name, from cat_var function
    
        Return
        ------
        df: new dataframe where categorical features are encoded
        label_list: classes_ attribute for all encoded features 
        """
    
        label_list = []
        cat_var_df = cat_var(df)
        cat_list = cat_var_df.loc[:, 'cat_name']
    
        for index, cat_feature in enumerate(cat_list): 
    
            le = LabelEncoder()
    
            le.fit(df.loc[:, cat_feature])    
            label_list.append(list(le.classes_))
    
            df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])
    
        return df, label_list
    

    The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.

    EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)

提交回复
热议问题