Sklearn Label Encoding multiple columns pandas dataframe

后端 未结 5 1062
春和景丽
春和景丽 2020-12-10 03:25

I try to encode a number of columns containing categorical data (\"Yes\" and \"No\") in a large pandas dataframe. The complete dataframe contains

相关标签:
5条回答
  • 2020-12-10 03:54

    Scikit-learn has something for this now: OrdinalEncoder

    from sklearn.preprocessing import OrdinalEncoder
    data = pd.DataFrame({'A': [1, 2, 3, 4],
                             'B': ["Yes", "No", "Yes", "Yes"],
                             'C': ["Yes", "No", "No", "Yes"],
                             'D': ["No", "Yes", "No", "Yes"]})
    
    oe = OrdinalEncoder()
    
    t_data = oe.fit_transform(data)
    print(t_data)
    # [[0. 1. 1. 0.]
    # [1. 0. 0. 1.]
    # [2. 1. 0. 0.]
    # [3. 1. 1. 1.]]
    

    Works straight out of the box.

    0 讨论(0)
  • 2020-12-10 03:59

    As the following code, you can encode the multiple columns by applying LabelEncoder to DataFrame. However, please note that we cannot obtain the classes information for all columns.

    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    
    df = pd.DataFrame({'A': [1, 2, 3, 4],
                       'B': ["Yes", "No", "Yes", "Yes"],
                       'C': ["Yes", "No", "No", "Yes"],
                       'D': ["No", "Yes", "No", "Yes"]})
    print(df)
    #    A    B    C    D
    # 0  1  Yes  Yes   No
    # 1  2   No   No  Yes
    # 2  3  Yes   No   No
    # 3  4  Yes  Yes  Yes
    
    # LabelEncoder
    le = LabelEncoder()
    
    # apply "le.fit_transform"
    df_encoded = df.apply(le.fit_transform)
    print(df_encoded)
    #    A  B  C  D
    # 0  0  1  1  0
    # 1  1  0  0  1
    # 2  2  1  0  0
    # 3  3  1  1  1
    
    # Note: we cannot obtain the classes information for all columns.
    print(le.classes_)
    # ['No' 'Yes']
    
    0 讨论(0)
  • 2020-12-10 03:59

    You can also loop through the different columns you want to apply the encoding to. This method might not the most efficient, but it works fine.

    from sklearn import preprocessing
    LE = preprocessing.LabelEncoder()
    for col in df.columns:
        df[col] = LE.fit(df[col])
        df[col] = LE.transform(df[col])
        test_data[col] = LE.transform(test_data[col])
    
    0 讨论(0)
  • 2020-12-10 04:15
    import pandas as pd
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.preprocessing import LabelBinarizer
    # df is the pandas dataframe
    class preprocessing (BaseEstimator, TransformerMixin):
          def __init__ (self, df):
             self.datatypes = df.dtypes.astype(str)
             self.catcolumns = []
             self.cat_encoders = []
             self.encoded_df = []
    
          def fit (self, df, y = None):
              for ix, val in zip(self.datatypes.index.values, 
              self.datatypes.values):
                  if val =='object':
                     self.catcolumns.append(ix)
              fit_objs = [str(i) for i in range(len(self.catcolumns))]
              for encs, name in zip(fit_objs,self.catcolumns):
                  encs = LabelBinarizer()
                  encs.fit(df[name])
                  self.cat_encoders.append((name, encs))
              return self
          def transform (self, df , y = None): 
              for name, encs in self.cat_encoders:
                  df_c = encs.transform(df[name])
                  self.encoded_df.append(pd.DataFrame(df_c))
              self.encoded_df = pd.concat(self.encoded_df, axis = 1, 
              ignore_index 
              = True)
              self.df_num = df.drop(self.catcolumns, axis = 1)
              y = pd.concat([self.df_num, self.encoded_df], axis = 1, 
              ignore_index = True)
              return y        
    # use return y.values to use in sci-kit learn pipeline
    """ Finds categorical columns in a dataframe and one hot encodes the 
        columns. you can replace labelbinarizer with labelencoder if you 
        require only label encoding. Function returns encoded categorcial data 
        and numerical data as a dataframe """
    
    0 讨论(0)
  • 2020-12-10 04:18

    First, find out all the features with type object:

    objList = all_data.select_dtypes(include = "object").columns
    print (objList)
    

    Now, to convert the above objList features into numeric type, you can use a forloop as given below:

    #Label Encoding for object to numeric conversion
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    
    for feat in objList:
        df[feat] = le.fit_transform(df[feat].astype(str))
    
    print (df.info())
    

    Note that we are explicitly mentioning as type string in the forloop because if you remove that it throws an error.

    0 讨论(0)
提交回复
热议问题