I\'m trying to use scikit-learn\'s LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to a
I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).
Very Rough ideas... first, identify which columns needed LabelEncoder, then loop through each column.
def cat_var(df):
"""Identify categorical features.
Parameters
----------
df: original df after missing operations
Returns
-------
cat_var_df: summary df with col index and col name for all categorical vars
"""
col_type = df.dtypes
col_names = list(df)
cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]
cat_var_df = pd.DataFrame({'cat_ind': cat_var_index,
'cat_name': cat_var_name})
return cat_var_df
from sklearn.preprocessing import LabelEncoder
def column_encoder(df, cat_var_list):
"""Encoding categorical feature in the dataframe
Parameters
----------
df: input dataframe
cat_var_list: categorical feature index and name, from cat_var function
Return
------
df: new dataframe where categorical features are encoded
label_list: classes_ attribute for all encoded features
"""
label_list = []
cat_var_df = cat_var(df)
cat_list = cat_var_df.loc[:, 'cat_name']
for index, cat_feature in enumerate(cat_list):
le = LabelEncoder()
le.fit(df.loc[:, cat_feature])
label_list.append(list(le.classes_))
df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])
return df, label_list
The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.
EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)