Can sklearn DecisionTreeClassifier truly work with categorical data?

While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data.

All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5:

From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn.

Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I am using one hot vectors on the data).

EDIT: a sample of the original data looks like this:

f1 f2 f3  f4  f5  f6  f7  f8  f9  f10  c1  c2  c3
0  C  S  O   1   2   1   1   2   1    2   0   0   0
1  D  S  O   1   3   1   1   2   1    2   0   0   0
2  C  S  O   1   3   1   1   2   1    1   0   0   0
3  D  S  O   1   3   1   1   2   1    2   0   0   0
4  D  A  O   1   3   1   1   2   1    2   0   0   0
5  D  A  O   1   2   1   1   2   1    2   0   0   0
6  D  A  O   1   2   1   1   2   1    1   0   0   0
7  D  A  O   1   2   1   1   2   1    2   0   0   0
8  D  K  O   1   3   1   1   2   1    2   0   0   0
9  C  R  O   1   3   1   1   2   1    1   0   0   0

where X[0] = f1 and I encoded strings to integers as sklearn does not accept strings.

Well, I am surprised, but it turns out that sklearn's decision tree cannot handle categorical data indeed. There is a Github issue on this (#4899) from June 2015, but it is still open (and I suggest you have a quick look at the thread, as some comments are very interesting).

The problem with coding categorical variables as integers, as you have done here, is that it imposes an order on them, which may or may not be meaningful, depending on the case; for example, you could encode ['low', 'medium', 'high'] as [0, 1, 2], since 'low' < 'medium' < 'high' (we call these categorical variables ordinal), although you are still implicitly making the additional (and possibly undesired) assumption that the distance between 'low' and 'medium' is the same with the distance between 'medium' and 'high' (of no impact in decision trees, but of importance e.g. in k-nn and clustering). But this approach fails completely in cases like, say, ['red','green','blue'] or ['male','female'], since we cannot claim any meaningful relative order between them.

So, for non-ordinal categorical variables, the way to properly encode them for use in sklearn's decision tree is to use the OneHotEncoder module. The Encoding categorical features section of the user's guide might also be helpful.

来源：https://stackoverflow.com/questions/47873366/can-sklearn-decisiontreeclassifier-truly-work-with-categorical-data

标签

python

machine-learning

scikit-learn

decision-tree

categorical-data