How to keep track of columns after encoding categorical variables?

坚强是说给别人听的谎言 提交于 2021-01-28 10:54:48

问题


I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it?

In the below code df_columns would tell me that column 0 in df_array is A, column 1 is B and so forth...

However when once I encode categorical column B df_columns is no longer valid for keeping track of df_dummies

import pandas as pd
import numpy as np

animal = ['dog','cat','horse']

df = pd.DataFrame({'A': np.random.rand(9),
                   'B': [animal[np.random.randint(3)] for i in range(9)],
                   'C': np.random.rand(9),
                   'D': np.random.rand(9)})

df_array = df.values
df_columns = df.columns

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
df_dummies = np.array(ct.fit_transform(df_array), dtype=np.float)

The solution should be agnostic of the position of the categorical column... be it A, B, C or D. I can do the grunt work and keep updating the df_columns dictionary... but it wouldn't be elegant or "pythonic"

Furthermore... how would the solution look to keep track of what the categoricals mean? {0,0,1} would be cat, {0,1,0} would be dog and so on?

PS - I am aware of the dummy variable trap and will take df_dummies[:,1:] when I actually use it to train my model.


回答1:


Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

  • First use get_dummies() to one-hot-encode the new data set.
  • Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
  • The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
  • Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html



来源:https://stackoverflow.com/questions/60103882/how-to-keep-track-of-columns-after-encoding-categorical-variables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!