How to deal with imputation and hot one encoding in pandas?

前端未结

关注

 2  1897

粉色の甜心 2021-01-25 17:18

I am trying to apply both imputation and hot one encoding on my data set. I know that on applying imputation, the dimension of data might change and so I took care of it manuall

2条回答

野性不改 (楼主)

2021-01-25 18:07
I've been struggling with a similar problem and I've found an approach that might help in this situation.

The main idea is to modify the type of the column to make it categorical when you are working with the complete dataset. Doing something like this:
```
dataframe[column] = dataframe[column].astype('category')
```
When you do that the dataframe's column will saved all the available categories. Later when you perform a train/test split of the data the categories will be saved even though the values might not be presented on one of the dataset.

Pandas get_dummies function uses the categories of the column to perform the encoding. Since the categories are stable you will always get the same amount of columns after encoding.

I'm exploring this solution myself. Keep in mind that you can manipulate the categories directly in case you need to. You can use something like this

dataframe[column].cat.set_categories([.....])
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...