问题
Im trying to solve a problem that has arisen with the productionisation of an XGBoost model. My current problem is the column order in the training data is not replicated identically in the column order in the production data I need to score. The issue has arisen from the onehot encoding step. Where not all levels of each variable are present in the production scoring data that was in the training data. This causes the scoring to come out with inconsistent and incorrect results, or the scoring process fails completely.
In an attempt to overcome I am trying to come up with a process inside the onehot encoding step that will ensure the column structure is consistent. My theory is that if I save the header vector created from the training dataset, I could then call the onehot predict function on this header set for each production scoring set.
For example. If I have 2 datasets test and train. I can onehot encode the train data through the onehot package as:
header <- onehot(train, max_levels = 100)
trainmatrix <- predict(header, train)
To preserve the column structure of this matrix, I want to simply use the header object already created above to onehot encode the test data, simply:
testmatrix <- predict(header, test)
Issue is that the results do not line up as I had hoped.
If I have train data:
To create header vector:
Then use this to onehot encode test data:
I get matrix:
These results obviously do not meet my expectations for an effective solution. Does anyone have a different solution for this?
来源:https://stackoverflow.com/questions/51997540/onehot-encoding-preserve-column-structure