onehot encoding: preserve column structure

岁酱吖の 提交于 2019-12-24 18:13:16

问题


Im trying to solve a problem that has arisen with the productionisation of an XGBoost model. My current problem is the column order in the training data is not replicated identically in the column order in the production data I need to score. The issue has arisen from the onehot encoding step. Where not all levels of each variable are present in the production scoring data that was in the training data. This causes the scoring to come out with inconsistent and incorrect results, or the scoring process fails completely.

In an attempt to overcome I am trying to come up with a process inside the onehot encoding step that will ensure the column structure is consistent. My theory is that if I save the header vector created from the training dataset, I could then call the onehot predict function on this header set for each production scoring set.

For example. If I have 2 datasets test and train. I can onehot encode the train data through the onehot package as:

header <- onehot(train, max_levels = 100)

trainmatrix <- predict(header, train)

To preserve the column structure of this matrix, I want to simply use the header object already created above to onehot encode the test data, simply:

testmatrix <- predict(header, test)

Issue is that the results do not line up as I had hoped.

If I have train data:

To create header vector:

Then use this to onehot encode test data:

I get matrix:

These results obviously do not meet my expectations for an effective solution. Does anyone have a different solution for this?

来源:https://stackoverflow.com/questions/51997540/onehot-encoding-preserve-column-structure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!