问题
I have a dataset that has a unique identifier and other features. It looks like this
ID LenA TypeA LenB TypeB Diff Score Response 123-456 51 M 101 L 50 0.2 0 234-567 46 S 49 S 3 0.9 1 345-678 87 M 70 M 17 0.7 0
I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs.
Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?
I am using the DecisionTreeClassifier
from Scikit-Learn. This is the code I have for the classifier.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)
If I just include the ID into the traindata
, the code throws an error:
ValueError: invalid literal for float(): 123-456
回答1:
Not knowing how you made your split I would suggest just making sure the ID
column is not included in your training data. Something like this perhaps:
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)
That will split only the values from the DataFrame not in ID
or Response
for the X
values, and split Response
for the y
values.
But you will still not be able to use the DecisionTreeClassifier
with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA
and TypeB
to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S']
into [1, 2]
which can be implemented with the DecisionTreeClassifier
. If you need an example take a look at Passing categorical data to sklearn decision tree.
Update
Per your comment I now understand that you need to map back to the ID
. In this case you can leverage pandas to your advantage. Set ID
as the index of your data and then do the split, that way you will retain the ID
value for all of your train and test data. Let's assume your data are already in a pandas dataframe.
df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
LenA TypeA LenB TypeB Diff Score
ID
345-678 87 M 70 M 17 0.7
234-567 46 S 49 S 3 0.9
来源:https://stackoverflow.com/questions/43549034/map-predictions-back-to-ids-python-scikit-learn-decisiontreeclassifier