feature-selection

Feature importances - Bagging, scikit-learn

 ̄綄美尐妖づ 提交于 2019-12-05 05:50:11
For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision tree this does not look to be available. My question: Does anybody know how to get the feature importances list for Bagging? Greetings, Kornee Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no feature importances implemented. There are model-independent methods for computing feature importances (see e

Genetic algorithms: fitness function for feature selection algorithm

随声附和 提交于 2019-12-05 03:16:21
I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network. I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to

Feature selection with caret rfe and training with another method

烈酒焚心 提交于 2019-12-04 20:19:31
Right now, I'm trying to use Caret rfe function to perform the feature selection, because I'm in a situation with p>>n and most regression techniques that don't involve some sort of regularisation can't be used well. I already used a few techniques with regularisation (Lasso), but what I want to try now is reduce my number of feature so that I'm able to run, at least decently, any kind of regression algorithm on it. control <- rfeControl(functions=rfFuncs, method="cv", number=5) model <- rfe(trainX, trainY, rfeControl=control) predict(model, testX) Right now, if I do it like this, a feature

R Caret's rfe [Error in { : task 1 failed - “rfe is expecting 184 importance values but only has 2”]

怎甘沉沦 提交于 2019-12-04 12:57:09
I am using Caret's rfe for a regression application. My data (in data.table ) has 176 predictors (including 49 factor predictors). When I run the function, I get this error: Error in { : task 1 failed - "rfe is expecting 176 importance values but only has 2" Then, I used model.matrix( ~ . - 1, data = as.data.frame(train_model_sell_single_bid)) to convert the factor predictors to dummy variables. However, I got similar error: Error in { : task 1 failed - "rfe is expecting 184 importance values but only has 2" I'm using R version 3.1.1 on Windows 7 (64-bit), Caret version 6.0-41. I also have

Sklearn Chi2 For Feature Selection

风格不统一 提交于 2019-12-04 07:36:57
I'm learning about chi2 for feature selection and came across code like this However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in features with the lowest scores. However, using scikit learns SelectKBest , the selector returns the values with the highest chi2 scores. Is my understanding of using the chi2 test incorrect? Or does the chi2 score in sklearn produce something other than a chi2 statistic? See code below for what I mean (mostly copied from above link except for the

How to programmatically determine the column indices of principal components using FactoMineR package?

こ雲淡風輕ζ 提交于 2019-12-04 07:29:49
Given a data frame containing mixed variables (i.e. both categorical and continuous) like, digits = 0:9 # set seed for reproducibility set.seed(17) # function to create random string createRandString <- function(n = 5000) { a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE)) } df <- data.frame(ID=c(1:10), name=sample(letters[1:10]), studLoc=sample(createRandString(10)), finalmark=sample(c(0:100),10), subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10) ) I perform unsupervised feature selection

How do I SelectKBest using mutual information from a mixture of discrete and continuous features?

末鹿安然 提交于 2019-12-04 04:02:14
问题 I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x and labels y and the first three feature values are discrete I can get the MMI values like so: mutual_info_classif(x, y, discrete_features=[0, 1, 2]) Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this SelectKBest(score_func=mutual

apache spark MLLib: how to build labeled points for string features?

耗尽温柔 提交于 2019-12-03 22:52:12
I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]] . Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]] . I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib? I believe the

Feature selection for multilabel classification (scikit-learn)

…衆ロ難τιáo~ 提交于 2019-12-03 20:30:36
I'm trying to do a feature selection by chi-square method in scikit-learn (sklearn.feature_selection.SelectKBest). When I'm trying to apply this to a multilabel problem, I get this warning: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task. warn("Duplicate scores. Result may depend on feature ordering." Why is it appearning and how to properly apply feature selection is this case? The code warns you that arbitrary tie-breaking may need to be performed because some features have

How can I get the relative importance of features of a logistic regression for a particular prediction?

醉酒当歌 提交于 2019-12-03 15:14:34
I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction. Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here , but I'm yet to find a good alternative. So far the best I have found are the following 3 options: Monte Carlo Option : Fixing all other features, re-run the prediction replacing the