Map predictions back to IDs - Python Scikit Learn DecisionTreeClassifier

问题

I have a dataset that has a unique identifier and other features. It looks like this

ID      LenA TypeA LenB TypeB Diff Score Response
123-456  51   M     101  L     50   0.2   0
234-567  46   S     49   S     3    0.9   1
345-678  87   M     70   M     17   0.7   0

I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs.
Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?

I am using the DecisionTreeClassifier from Scikit-Learn. This is the code I have for the classifier.

from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)

If I just include the ID into the traindata, the code throws an error:

ValueError: invalid literal for float(): 123-456

回答1:

Not knowing how you made your split I would suggest just making sure the ID column is not included in your training data. Something like this perhaps:

X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)

That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.

But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.

Update

Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.

df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
         LenA TypeA  LenB TypeB  Diff  Score
ID
345-678    87     M    70     M    17    0.7
234-567    46     S    49     S     3    0.9

来源：https://stackoverflow.com/questions/43549034/map-predictions-back-to-ids-python-scikit-learn-decisiontreeclassifier

标签

python

scikit-learn

classification

decision-tree

valueerror