How to get most informative features for scikit-learn classifiers?

匿名 (未验证) 提交于 2019-12-03 02:03:01

问题:

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:

viagra = None          ok : spam     =      4.5 : 1.0 hello = True           ok : spam     =      4.5 : 1.0 hello = None           spam : ok     =      3.3 : 1.0 viagra = True          spam : ok     =      3.3 : 1.0 casino = True          spam : ok     =      2.0 : 1.0 casino = None          ok : spam     =      1.5 : 1.0

My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.

If there is no such function yet, does somebody know a workaround how to get to those values?

Thanks alot!

回答1:

The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer, and you are using a linear model (e.g. LinearSVC or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):

def print_top10(vectorizer, clf, class_labels):     """Prints features with the highest coefficient values, per class"""     feature_names = vectorizer.get_feature_names()     for i, class_label in enumerate(class_labels):         top10 = np.argsort(clf.coef_[i])[-10:]         print("%s: %s" % (class_label,               " ".join(feature_names[j] for j in top10)))

This is for multiclass classification; for the binary case, I think you should use clf.coef_[0] only. You may have to sort the class_labels.



回答2:

With the help of larsmans code I came up with this code for the binary case:

def show_most_informative_features(vectorizer, clf, n=20):     feature_names = vectorizer.get_feature_names()     coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))     top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])     for (coef_1, fn_1), (coef_2, fn_2) in top:         print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)


回答3:

To add an update, RandomForestClassifier now supports the .feature_importances_ attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be

I find this attribute very useful when performing feature engineering.

Thanks to the scikit-learn team and contributors for implementing this!

edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier and GradientBoostingRegressor all support this.



回答4:

We've recently released a library (https://github.com/TeamHG-Memex/eli5) which allows to do that: it handles variuos classifiers from scikit-learn, binary / multiclass cases, allows to highlight text according to feature values, integrates with IPython, etc.



回答5:

RandomForestClassifier does not yet have a coef_ attrubute, but it will in the 0.17 release, I think. However, see the RandomForestClassifierWithCoef class in Recursive feature elimination on Random Forest using scikit-learn. This may give you some ideas to work around the limitation above.



回答6:

You can also do something like this to create a graph of importance features by order:

importances = clf.feature_importances_ std = np.std([tree.feature_importances_ for tree in clf.estimators_],          axis=0) indices = np.argsort(importances)[::-1]  # Print the feature ranking #print("Feature ranking:")   # Plot the feature importances of the forest plt.figure() plt.title("Feature importances") plt.bar(range(train[features].shape[1]), importances[indices],    color="r", yerr=std[indices], align="center") plt.xticks(range(train[features].shape[1]), indices) plt.xlim([-1, train[features].shape[1]]) plt.show()


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!