问题
I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
and my train code is:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train? How to get feature importance in xgboost?
回答1:
In your code you can get feature importance for each feature in dict form:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Explanation: The train() API's method get_score() is defined as:
get_score(fmap='', importance_type='weight')
- fmap (str (optional)) – The name of feature map file.
- importance_type
- ‘weight’ - the number of times a feature is used to split the data across all trees.
- ‘gain’ - the average gain across all splits the feature is used in.
- ‘cover’ - the average coverage across all splits the feature is used in.
- ‘total_gain’ - the total gain across all splits the feature is used in.
- ‘total_cover’ - the total coverage across all splits the feature is used in.
https://xgboost.readthedocs.io/en/latest/python/python_api.html
回答2:
Using sklearn API and XGBoost >= 0.81:
clf.get_booster().get_score(importance_type="gain")
or
regr.get_booster().get_score(importance_type="gain")
For this to work correctly, when you call regr.fit
(or clf.fit
), X
must be a pandas.DataFrame
.
回答3:
Try this
fscore = clf.best_estimator_.booster().get_fscore()
回答4:
I don't know how to get values certainly, but there is a good way to plot features importance:
model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
回答5:
For feature importance Try this:
Classification:
pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
Regression:
xgb.plot_importance(bst)
回答6:
Build the model from XGboost first
from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)
this would result in an array. So we can sort it with descending
sorted_idx = np.argsort(model.feature_importances_)[::-1]
Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)
for index in sorted_idx:
print([train.columns[index], model.feature_importances_[index]])
Furthermore, we can plot the importances with XGboost built-in function
plot_importance(model, max_num_features = 15)
pyplot.show()
use max_num_features
in plot_importance
to limit the number of features if you want.
回答7:
For anyone who comes across this issue while using xgb.XGBRegressor()
the workaround I'm using is to keep the data in a pandas.DataFrame()
or numpy.array()
and not to convert the data to dmatrix()
. Also, I had to make sure the gamma
parameter is not specified for the XGBRegressor.
fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)
After fitting the regressor fit.feature_importances_
returns an array of weights which I'm assuming is in the same order as the feature columns of the pandas dataframe.
My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.
回答8:
Get the table containing scores and feature names, and then plot it.
feature_important = model.get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')
For example:
回答9:
print(model.feature_importances_)
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
来源:https://stackoverflow.com/questions/37627923/how-to-get-feature-importance-in-xgboost