How to handle One-Hot Encoding in production environment when number of features in Training and Test are different?

问题

While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:

Training Set:

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Hatchback     |
|  2  | Sedan         |
|  3  | Coupe         |
|  4  | SUV           |
-----------------------

After One- Hot Encoding this, this is what we get:

-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
|  1  |     1     |   0   |   0    |  0 |
|  2  |     0     |   1   |   0    |  0 |
|  3  |     0     |   0   |   1    |  0 |
|  4  |     0     |   0   |   0    |  1 |
-----------------------------------------

My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:

Test Set :

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Coupe         |
|  2  | Sedan         |
-----------------------

One-Hot Encoding results in :

---------------------------
| Ser | Coupe     | Sedan |
---------------------------
|  1  |     1     |   0   |
|  2  |     0     |   1   |
|  3  |     1     |   0   |
---------------------------

Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?

回答1:

I'll assume you are using pandas to do the one hot encoding. If not, you have to do some more work, but the logic is still the same.

import pandas as pd

known_categories = ['Sedan','Coupe','Limo'] # from training set

car_type = pd.Series(['Sedan','Ferrari']) # new category in production, 'Ferrari'

car_type = pd.Categorical(car_type, categories = known_categories)

pd.get_dummies(car_type)

Result is

    Sedan   Coupe   Limo
0   1.0      0.0    0.0    # Sedan entry
1   0.0      0.0    0.0    # Ferrari entry

Since Ferrari is not in the list of known categories, all the one ot encoded entries for the Ferrari are zero. If you find a new car type in your production data, the rows encoding the car type should all be 0.

回答2:

The input to your model in production should be the same as during training. So if during training you one-hot encode 4 categories - do the same in production. Use zeros for missing features. Drop features you have not seen during training.

来源：https://stackoverflow.com/questions/51505295/how-to-handle-one-hot-encoding-in-production-environment-when-number-of-features

标签

python

machine-learning

feature-selection

one-hot-encoding