Can sklearn DecisionTreeClassifier truly work with categorical data?

笑着哭i 提交于 2019-12-01 17:41:47

Well, I am surprised, but it turns out that sklearn's decision tree cannot handle categorical data indeed. There is a Github issue on this (#4899) from June 2015, but it is still open (and I suggest you have a quick look at the thread, as some comments are very interesting).

The problem with coding categorical variables as integers, as you have done here, is that it imposes an order on them, which may or may not be meaningful, depending on the case; for example, you could encode ['low', 'medium', 'high'] as [0, 1, 2], since 'low' < 'medium' < 'high' (we call these categorical variables ordinal), although you are still implicitly making the additional (and possibly undesired) assumption that the distance between 'low' and 'medium' is the same with the distance between 'medium' and 'high' (of no impact in decision trees, but of importance e.g. in k-nn and clustering). But this approach fails completely in cases like, say, ['red','green','blue'] or ['male','female'], since we cannot claim any meaningful relative order between them.

So, for non-ordinal categorical variables, the way to properly encode them for use in sklearn's decision tree is to use the OneHotEncoder module. The Encoding categorical features section of the user's guide might also be helpful.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!