问题
I am applying OneHotEncoder on numpy array.
Here's the code
print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()
print X.shape, test_data.shape #gives (4100, 46) (410, 43)
where both X
and test_data
are <type 'numpy.ndarray'>
X
is my train set while test_data
my test set.
How come the no. of columns different for X
& test_data
. they should be 46 or either 43 for both after applying onehotencoder.
I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X
and test_data
Can someone point out what is wrong here?
回答1:
Don't use a new OneHotEncoder on test_data
, use the first one, and only use transform()
on it. Do this:
test_data = onehotencoder_1.transform(test_data).toarray()
Never use fit()
(or fit_transform()
) on testing data.
The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit()
(or fit_transform()
) on it, it will only learn about categories present in test_data
. So there will be difference between the columns.
来源:https://stackoverflow.com/questions/50460930/applying-onehotencoder-on-numpy-array