applying onehotencoder on numpy array

会有一股神秘感。 提交于 2019-12-13 23:00:17

问题


I am applying OneHotEncoder on numpy array.

Here's the code

print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()

print X.shape, test_data.shape #gives (4100, 46) (410, 43)

where both X and test_data are <type 'numpy.ndarray'>

X is my train set while test_data my test set.

How come the no. of columns different for X & test_data. they should be 46 or either 43 for both after applying onehotencoder.

I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X and test_data

Can someone point out what is wrong here?


回答1:


Don't use a new OneHotEncoder on test_data, use the first one, and only use transform() on it. Do this:

test_data = onehotencoder_1.transform(test_data).toarray()

Never use fit() (or fit_transform()) on testing data.

The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit() (or fit_transform()) on it, it will only learn about categories present in test_data. So there will be difference between the columns.



来源:https://stackoverflow.com/questions/50460930/applying-onehotencoder-on-numpy-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!