问题
There are several posts about how to encode categorical data to Sklearn Decission trees, but from Sklearn documentation, we got these
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
But running the following script
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
outputs the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?
回答1:
Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.
Refer to the following code from the documentation:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform
as follows:
list(le.inverse_transform([2, 2, 1]))
This would return ['tokyo', 'tokyo', 'paris']
.
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.
Hope this helps!
回答2:
(This is just a reformat of my comment above from 2016...it still holds true.)
The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier()
will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
Using a OneHotEncoder
is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.
回答3:
(..)
Able to handle both numerical and categorical data.
This only means that you can use
- the DecisionTreeClassifier class for classification problems
- the DecisionTreeRegressor class for regression.
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])
回答4:
Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:
def cat2int(column):
vals = list(set(column))
for i, string in enumerate(column):
column[i] = vals.index(string)
return column
来源:https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree