How to get feature importance in xgboost?

后端 未结 11 1774
情深已故
情深已故 2020-12-13 07:00

I\'m using xgboost to build a model, and try to find the importance of each feature using get_fscore(), but it returns {}

and my train code

11条回答
  •  半阙折子戏
    2020-12-13 07:29

    According to this post there 3 different ways to get feature importance from Xgboost:

    • use built-in feature importance,
    • use permutation based importance,
    • use shap based importance.

    Built-in feature importance

    Code example:

    xgb = XGBRegressor(n_estimators=100)
    xgb.fit(X_train, y_train)
    sorted_idx = xgb.feature_importances_.argsort()
    plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
    plt.xlabel("Xgboost Feature Importance")
    

    Please be aware of what type of feature importance you are using. There are several types of importance, see the docs. The scikit-learn like API of Xgboost is returning gain importance while get_fscore returns weight type.

    Permutation based importance

    perm_importance = permutation_importance(xgb, X_test, y_test)
    sorted_idx = perm_importance.importances_mean.argsort()
    plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
    plt.xlabel("Permutation Importance")
    

    This is my preferred way to compute the importance. However, it can fail in case highly colinear features, so be careful! It's using permutation_importance from scikit-learn.

    SHAP based importance

    explainer = shap.TreeExplainer(xgb)
    shap_values = explainer.shap_values(X_test)
    shap.summary_plot(shap_values, X_test, plot_type="bar")
    

    To use the above code, you need to have shap package installed.

    I was running the example analysis on Boston data (house price regression from scikit-learn). Below 3 feature importance:

    Built-in importance

    Permutation based importance

    SHAP importance

    All plots are for the same model! As you see, there is a difference in the results. I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).

提交回复
热议问题