可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features()
, which is really helpful for debugging features:
viagra = None ok : spam = 4.5 : 1.0 hello = True ok : spam = 4.5 : 1.0 hello = None spam : ok = 3.3 : 1.0 viagra = True spam : ok = 3.3 : 1.0 casino = True spam : ok = 2.0 : 1.0 casino = None ok : spam = 1.5 : 1.0
My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.
If there is no such function yet, does somebody know a workaround how to get to those values?
Thanks alot!
回答1:
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g. LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):
def print_top10(vectorizer, clf, class_labels): """Prints features with the highest coefficient values, per class""" feature_names = vectorizer.get_feature_names() for i, class_label in enumerate(class_labels): top10 = np.argsort(clf.coef_[i])[-10:] print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top10)))
This is for multiclass classification; for the binary case, I think you should use clf.coef_[0]
only. You may have to sort the class_labels
.
回答2:
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20): feature_names = vectorizer.get_feature_names() coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) for (coef_1, fn_1), (coef_2, fn_2) in top: print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)
回答3:
To add an update, RandomForestClassifier
now supports the .feature_importances_
attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be
I find this attribute very useful when performing feature engineering.
Thanks to the scikit-learn team and contributors for implementing this!
edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier
, RandomForestRegressor
, GradientBoostingClassifier
and GradientBoostingRegressor
all support this.
回答4:
We've recently released a library (https://github.com/TeamHG-Memex/eli5) which allows to do that: it handles variuos classifiers from scikit-learn, binary / multiclass cases, allows to highlight text according to feature values, integrates with IPython, etc.
回答5:
RandomForestClassifier
does not yet have a coef_
attrubute, but it will in the 0.17 release, I think. However, see the RandomForestClassifierWithCoef
class in Recursive feature elimination on Random Forest using scikit-learn. This may give you some ideas to work around the limitation above.
回答6:
You can also do something like this to create a graph of importance features by order:
importances = clf.feature_importances_ std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0) indices = np.argsort(importances)[::-1] # Print the feature ranking #print("Feature ranking:") # Plot the feature importances of the forest plt.figure() plt.title("Feature importances") plt.bar(range(train[features].shape[1]), importances[indices], color="r", yerr=std[indices], align="center") plt.xticks(range(train[features].shape[1]), indices) plt.xlim([-1, train[features].shape[1]]) plt.show()