One-Hot Encoding in Scikit-learn for only part of the DataFrame

自作多情 提交于 2019-12-11 07:49:13

问题


I am trying to use a decision tree classier on my data which looks very similar to the data in this tutorial: https://www.ritchieng.com/machinelearning-one-hot-encoding/

The tutorial then goes on convert the strings into numeric data:

X = pd.read_csv('titanic_data.csv')
X = X.select_dtypes(include=[object])
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)

This leaves the DataFrame looking like this:

After this, the data is put through the OneHotEncoder and I assume can then be split and passed into a decision tree classier fairly easily.

The problem is that it appears to me that the original numeric data gets lots through this process of encoding. How can I keep or add in later the numeric data that was removed during the encoding process? Thanks!


回答1:


Actually there is a really simple solution - using pd.get_dummies()

If you have a Data Frame like the following:

so_data = {
    'passenger_id': [1,2,3,4,5],
    'survived': [1,0,0,1,0],
    'age': [24,25,68,39,5],
    'sex': ['female', 'male', 'male', 'female', 'female'],
    'first_name': ['Joanne', 'Mark', 'Josh', 'Petka', 'Ariel']
}
so_df = pd.DataFrame(so_data)

which looks like:

    passenger_id    survived    age   sex       first_name
0              1           1    24  female        Joanne
1              2           0    25  male          Mark
2              3           0    68  male          Josh
3              4           1    39  female        Petka
4              5           0    5   female        Ariel

You can just do:

pd.get_dummies(so_df)

which will give you:

(sorry for the screenshot, but it's really difficult to clean the df on SO)



来源:https://stackoverflow.com/questions/56584288/one-hot-encoding-in-scikit-learn-for-only-part-of-the-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!