feature-selection | 易学教程

Feature importances - Bagging, scikit-learn

阅读更多关于 Feature importances - Bagging, scikit-learn

问题 For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision tree this does not look to be available. My question: Does anybody know how to get the feature importances list for Bagging? Greetings, Kornee 回答1: Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no

Feature selection for multilabel classification (scikit-learn)

阅读更多关于 Feature selection for multilabel classification (scikit-learn)

问题 I'm trying to do a feature selection by chi-square method in scikit-learn (sklearn.feature_selection.SelectKBest). When I'm trying to apply this to a multilabel problem, I get this warning: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task. warn("Duplicate scores. Result may depend on feature ordering." Why is it appearning and how to properly apply feature selection is this case

How can I get the relative importance of features of a logistic regression for a particular prediction?

阅读更多关于 How can I get the relative importance of features of a logistic regression for a particular prediction?

问题 I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction. Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here, but I'm yet to find a good alternative. So far the best I have found are the

Python's implementation of Mutual Information

阅读更多关于 Python's implementation of Mutual Information

问题 I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular : sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None) (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html) I am trying to implement the example I find in the Stanford NLP tutorial site: The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2 The

Difference between varImp (caret) and importance (randomForest) for Random Forest

阅读更多关于 Difference between varImp (caret) and importance (randomForest) for Random Forest

问题 I do not understand which is the difference between varImp function ( caret package) and importance function ( randomForest package) for a Random Forest model: I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions: Here is my code: rfImp <- randomForest(Origin ~ ., data = TAll_CS, ntree = 2000, importance = TRUE) importance(rfImp) BREAST LUNG MeanDecreaseAccuracy MeanDecreaseGini Energy

Example for svm feature selection in R

阅读更多关于 Example for svm feature selection in R

问题 I'm trying to apply feature selection (e.g. recursive feature selection) in SVM, using the R package. I've installed Weka which supports feature selection in LibSVM but I haven't found any example for the syntax of SVM or anything similar. A short example would be of a great help. 回答1: The function rfe in the caret package performs recursive feature selection for various algorithms. Here's an example from the caret documentation: library(caret) data(BloodBrain, package="caret") x <- scale

find important features for classification

阅读更多关于 find important features for classification

问题 I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial. What I would like to do is after the classification to see which features were the most useful in classifying the trials. How

How to use scikit-learn PCA for features reduction and know which features are discarded

阅读更多关于 How to use scikit-learn PCA for features reduction and know which features are discarded

问题 I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples. Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way: from sklearn.decomposition import PCA nf = 100 pca = PCA(n_components=nf) # X is the matrix transposed (n samples on the rows, m features on the columns) pca.fit(X) X_new = pca.transform(X) Now, I get a new matrix X_new that has a shape of n x nf. Is it

Linear regression analysis with string/categorical features (variables)?

阅读更多关于 Linear regression analysis with string/categorical features (variables)?

问题 Regression algorithms seem to be working on features represented as numbers. For example: This dataset doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price. But now I want to do regression analysis on data that contain categorical features: There are 5 features: District , Condition , Material , Security , Type How can I do regression on this data? Do I have to transform all this string/categorical data to numbers manually? I

How are feature_importances in RandomForestClassifier determined?

阅读更多关于 How are feature_importances in RandomForestClassifier determined?

问题 I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_ , which works well for me. However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on