Determine most important feature per class

倖福魔咒の 提交于 2020-12-07 18:23:23

问题


Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.

I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.

What would be a good feature selection algorithm or heuristic that can do this?


回答1:


When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.

In scikit-learn you can access to the parameters of the model If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:

clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
    top20_indices = np.argsort(clf.coef_[i])[-20:]
    print top20_indices

clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class. If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.

For more information refer to scikit-learn text classification example




回答2:


Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.



来源:https://stackoverflow.com/questions/33118361/determine-most-important-feature-per-class

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!