问题
When I mention the feature names while defining the data matrix in an internal data structure used by XGBoost, I get this error:
d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X))
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X))
...
...
...
shap_values = shap.TreeExplainer(model).shap_values(X_train)
shap.summary_plot(shap_values, X_train)
ValueError Traceback (most recent call last)
<ipython-input-59-4635c450279d> in <module>()
----> 1 shap_values = shap.TreeExplainer(model).shap_values(X_train)
2 shap.summary_plot(shap_values, X_train)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\explainers\tree.py in shap_values(self, X, **kwargs)
104 if not str(type(X)).endswith("xgboost.core.DMatrix'>"):
105 X = xgboost.DMatrix(X)
--> 106 phi = self.trees.predict(X, pred_contribs=True)
107 elif self.model_type == "lightgbm":
108 phi = self.trees.predict(X, pred_contrib=True)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs)
1042 option_mask |= 0x08
1043
-> 1044 self._validate_features(data)
1045
1046 length = c_bst_ulong()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in _validate_features(self, data)
1286
1287 raise ValueError(msg.format(self.feature_names,
-> 1288 data.feature_names))
1289
1290 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):
ValueError: feature_names mismatch: ['Serial No', 'gender', 'Date', 'Product_Type', 'Product_Type', ... ... , 'Last_feature'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39']
<names of some features at column number corresponding to feature number in the following list> in input data
training data did not have the following fields: f7, f31, f33, f11, f6, f26, f2, f5, f17, f4, f37, f9, f1, f0, f39, f14, f12, f23, f13, f15, f22, f19, f35, f24, f38, f8, f28, f25, f20, f34, f27, f32, f36, f29, f16, f3, f21, f18, f30, f10
When I don't specify the feature names while defining the DMatrix
, I get no errors and get the following output graph/plot:
But I need the names of the features to appear in the plot instead of Feature 2
, Feature 15
, etc. Why is this error occurring and how do I fix it?
In case you want it, here's the full code, which is basically me trying to replicate the visualizations in this link, but for my dataset and accordingly customized model training parameters:
from sklearn.model_selection import train_test_split
import xgboost
import shap
import xlrd
import numpy as np
import matplotlib.pylab as pl
# print the JS visualization code to the notebook
shap.initjs()
import pandas as pd
data = pd.read_csv('InputCEM_FS_out.csv')
X = data.loc[:, data.columns != 'Score']
y = data['Score']
y = y/max(y)
# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
# Some of values are float or integer and some object. This is why we need to cast them:
from sklearn import preprocessing
for f in X_train.columns:
if X_train[f].dtype=='object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(X_train[f].values))
X_train[f] = lbl.transform(list(X_train[f].values))
for f in X_test.columns:
if X_test[f].dtype=='object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(X_test[f].values))
X_test[f] = lbl.transform(list(X_test[f].values))
X_train.fillna((-999), inplace=True)
X_test.fillna((-999), inplace=True)
X_train=np.array(X_train)
X_test=np.array(X_test)
X_train = X_train.astype(float)
X_test = X_test.astype(float)
d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.
params = [
('max_depth', 3),
('eta', 0.025),
('objective', 'binary:logistic'),
('min_child_weight', 4),
('silent', 1),
('eval_metric', 'auc'),
('subsample', 0.75),
('colsample_bytree', 0.75),
('gamma', 0.75),
]
model = xgboost.train(params, d_train, 5000, evals = [(d_test, "test")], verbose_eval=100, early_stopping_rounds=20)
shap_values = shap.TreeExplainer(model).shap_values(X_train) # This line is what gives the error if the feature names are specified
shap.summary_plot(shap_values, X_train)
回答1:
As we see, the issue is that d_test's columns are being renamed to f7, f31,...
), while d_train's columns are not.
It seems, the cause is here:
shap_values = shap.TreeExplainer(model).shap_values(X_train)
You pass X_train, while it's just a numpy array without column names (they become f31, f7
, and so on). Instead, try to pass a DataFrame with desired columns:
shap_values = shap.TreeExplainer(model).shap_values(pd.DataFrame(X_train, columns=X.columns))
来源:https://stackoverflow.com/questions/50711382/why-am-i-getting-a-valueerror-feature-names-mismatch-when-specifying-the-feat