feature-selection

How does sklearn random forest index feature_importances_

微笑、不失礼 提交于 2019-12-03 13:48:18
I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features. important_features = [] for x,i in enumerate(rf.feature_importances_): if i>np.average(rf.feature_importances_): important_features.append(str(x)) print important_features Additionally, in an effort to understand the indexing, I was

scikit learn - feature importance calculation in decision trees

一曲冷凌霜 提交于 2019-12-03 09:27:23
问题 I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing. For example: from StringIO import StringIO from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.tree.export import export_graphviz from sklearn.feature_selection import mutual_info_classif X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]] y = [1,0,1,1]

Perform Chi-2 feature selection on TF and TF*IDF vectors

若如初见. 提交于 2019-12-03 05:13:35
问题 I'm experimenting with Chi-2 feature selection for some text classification tasks. I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom. Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

十年热恋 提交于 2019-12-03 04:14:54
问题 I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary = ['hi ', 'bye', 'run away'] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_ gives me: {'hi ': 0, 'bye': 1, 'run away': 2} Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams

Python's implementation of Mutual Information

自闭症网瘾萝莉.ら 提交于 2019-12-03 02:52:15
I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular : sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None) ( http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html ) I am trying to implement the example I find in the Stanford NLP tutorial site: The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2 The problem is I keep getting different results, without figuring out the reason yet. I get the concept

Difference between varImp (caret) and importance (randomForest) for Random Forest

柔情痞子 提交于 2019-12-03 02:36:21
I do not understand which is the difference between varImp function ( caret package) and importance function ( randomForest package) for a Random Forest model: I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions: Here is my code: rfImp <- randomForest(Origin ~ ., data = TAll_CS, ntree = 2000, importance = TRUE) importance(rfImp) BREAST LUNG MeanDecreaseAccuracy MeanDecreaseGini Energy_GLCM_R1SC4NG3 -1.44116806 2.8918537 1.0929302 0.3712622 Contrast_GLCM_R1SC4NG3 -2.61146974 1.5848150 -0

Correlated features and classification accuracy

喜夏-厌秋 提交于 2019-12-03 01:21:43
问题 I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the perimeter and the area of a geometric figure or the level of education and the average income). In my opinion correlated features negatively affect eh accuracy of a classification algorithm, I'd say because the correlation makes one of them useless. Is

scikit learn - feature importance calculation in decision trees

蓝咒 提交于 2019-12-02 23:41:46
I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing. For example: from StringIO import StringIO from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.tree.export import export_graphviz from sklearn.feature_selection import mutual_info_classif X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]] y = [1,0,1,1] clf = DecisionTreeClassifier() clf.fit(X, y) feat_importance = clf.tree_.compute_feature_importances

Reshape error when using mutual_info regression for feature selection

跟風遠走 提交于 2019-12-02 20:19:11
问题 I am trying to do some feature selection using mutual_info_regression with SelectKBest wrapper. However I keep running into an error indicating that my list of features needs to be reshaped into a 2D array, not quite sure why I keep getting this message- #feature selection before linear regression benchmark test import sklearn from sklearn.feature_selection import mutual_info_regression, SelectKBest features = list(housing_data[housing_data.columns.difference(['sale_price'])]) target = 'sale

Example for svm feature selection in R

∥☆過路亽.° 提交于 2019-12-02 18:36:57
I'm trying to apply feature selection (e.g. recursive feature selection) in SVM, using the R package. I've installed Weka which supports feature selection in LibSVM but I haven't found any example for the syntax of SVM or anything similar. A short example would be of a great help. The function rfe in the caret package performs recursive feature selection for various algorithms. Here's an example from the caret documentation : library(caret) data(BloodBrain, package="caret") x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)]) x <- x[, -findCorrelation(cor(x), .8)] x <- as.data.frame(x) svmProfile <-