Why am I getting a “ValueError: feature_names mismatch” when specifying the feature-name list in XGBoost for visualization?

问题

When I mention the feature names while defining the data matrix in an internal data structure used by XGBoost, I get this error:

d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X))
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X))
...
...
...
shap_values = shap.TreeExplainer(model).shap_values(X_train)
shap.summary_plot(shap_values, X_train)

ValueError                                Traceback (most recent call last)
<ipython-input-59-4635c450279d> in <module>()
----> 1 shap_values = shap.TreeExplainer(model).shap_values(X_train)
      2 shap.summary_plot(shap_values, X_train)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\explainers\tree.py in shap_values(self, X, **kwargs)
    104             if not str(type(X)).endswith("xgboost.core.DMatrix'>"):
    105                 X = xgboost.DMatrix(X)
--> 106             phi = self.trees.predict(X, pred_contribs=True)
    107         elif self.model_type == "lightgbm":
    108             phi = self.trees.predict(X, pred_contrib=True)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs)
   1042             option_mask |= 0x08
   1043 
-> 1044         self._validate_features(data)
   1045 
   1046         length = c_bst_ulong()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in _validate_features(self, data)
   1286 
   1287                 raise ValueError(msg.format(self.feature_names,
-> 1288                                             data.feature_names))
   1289 
   1290     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['Serial No', 'gender', 'Date', 'Product_Type', 'Product_Type', ... ... , 'Last_feature'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39']
<names of some features at column number corresponding to feature number in the following list> in input data
training data did not have the following fields: f7, f31, f33, f11, f6, f26, f2, f5, f17, f4, f37, f9, f1, f0, f39, f14, f12, f23, f13, f15, f22, f19, f35, f24, f38, f8, f28, f25, f20, f34, f27, f32, f36, f29, f16, f3, f21, f18, f30, f10

When I don't specify the feature names while defining the DMatrix, I get no errors and get the following output graph/plot:

But I need the names of the features to appear in the plot instead of Feature 2, Feature 15, etc. Why is this error occurring and how do I fix it?

In case you want it, here's the full code, which is basically me trying to replicate the visualizations in this link, but for my dataset and accordingly customized model training parameters:

from sklearn.model_selection import train_test_split
import xgboost
import shap
import xlrd
import numpy as np
import matplotlib.pylab as pl

# print the JS visualization code to the notebook
shap.initjs()

import pandas as pd
data = pd.read_csv('InputCEM_FS_out.csv')
X = data.loc[:, data.columns != 'Score'] 
y = data['Score']
y = y/max(y)

# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)


# Some of values are float or integer and some object. This is why we need to cast them:
from sklearn import preprocessing 
for f in X_train.columns: 
    if X_train[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(X_train[f].values)) 
        X_train[f] = lbl.transform(list(X_train[f].values))

for f in X_test.columns: 
    if X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(X_test[f].values)) 
        X_test[f] = lbl.transform(list(X_test[f].values))

X_train.fillna((-999), inplace=True) 
X_test.fillna((-999), inplace=True)

X_train=np.array(X_train) 
X_test=np.array(X_test) 
X_train = X_train.astype(float) 
X_test = X_test.astype(float)

d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.

params = [
    ('max_depth', 3),
    ('eta', 0.025),
    ('objective', 'binary:logistic'),
    ('min_child_weight', 4),
    ('silent', 1),
    ('eval_metric', 'auc'),
    ('subsample', 0.75),
    ('colsample_bytree', 0.75),
    ('gamma', 0.75),
]

model = xgboost.train(params, d_train, 5000, evals = [(d_test, "test")], verbose_eval=100, early_stopping_rounds=20)

shap_values = shap.TreeExplainer(model).shap_values(X_train) # This line is what gives the error if the feature names are specified
shap.summary_plot(shap_values, X_train)

回答1:

As we see, the issue is that d_test's columns are being renamed to f7, f31,...), while d_train's columns are not. It seems, the cause is here:

shap_values = shap.TreeExplainer(model).shap_values(X_train)

You pass X_train, while it's just a numpy array without column names (they become f31, f7, and so on). Instead, try to pass a DataFrame with desired columns:

shap_values = shap.TreeExplainer(model).shap_values(pd.DataFrame(X_train, columns=X.columns))

来源：https://stackoverflow.com/questions/50711382/why-am-i-getting-a-valueerror-feature-names-mismatch-when-specifying-the-feat

标签

python

matplotlib

plot

visualization

xgboost