feature-selection

What is the optimal way to choose a set of features for excluding items based on a bitmask when matching against a large set?

旧巷老猫 提交于 2019-12-08 07:26:19
问题 Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test. Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the

How SelectKBest (chi2) calculates score?

泄露秘密 提交于 2019-12-08 05:33:45
问题 I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, but I don't understand exactly how this score value is calculated. I know that theoretically high score is more valuable, but I need a mathematical formula or an example to calculate the score for learning this deeply. bestfeatures = SelectKBest(score_func=chi2, k=10) fit = bestfeatures.fit

Is there a way to use recursive feature selection with non linear models with scikit-learn?

允我心安 提交于 2019-12-08 03:53:00
问题 I am trying to use SVR with an rbf kernel (obviously) on a regression problem. My dataset has something like 300 features. I would like to select more relevant features and use something like the sequentialfs function of matlab which would try every combination (or anyway starting with few variables and adding variables on the way, or the opposite, going backward, like the RFE or RFECV of scikit)). Now, as said, for python there is the RFE but I can't use it with a non linear estimator. Is

Putting together sklearn pipeline+nested cross-validation for KNN regression

隐身守侯 提交于 2019-12-07 15:50:21
问题 I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor that includes: normalize features feature selection (best subset of 20 numeric features, no specific total) cross-validates hyperparameter K in range 1 to 20 cross-validates model uses RMSE as error metric There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need. Besides sklearn.neighbors.KNeighborsRegressor , I think I need: sklearn.pipeline

R Caret's rfe [Error in { : task 1 failed - “rfe is expecting 184 importance values but only has 2”]

江枫思渺然 提交于 2019-12-06 07:18:30
问题 I am using Caret's rfe for a regression application. My data (in data.table ) has 176 predictors (including 49 factor predictors). When I run the function, I get this error: Error in { : task 1 failed - "rfe is expecting 176 importance values but only has 2" Then, I used model.matrix( ~ . - 1, data = as.data.frame(train_model_sell_single_bid)) to convert the factor predictors to dummy variables. However, I got similar error: Error in { : task 1 failed - "rfe is expecting 184 importance values

Sklearn Chi2 For Feature Selection

岁酱吖の 提交于 2019-12-06 04:02:59
问题 I'm learning about chi2 for feature selection and came across code like this However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in features with the lowest scores. However, using scikit learns SelectKBest, the selector returns the values with the highest chi2 scores. Is my understanding of using the chi2 test incorrect? Or does the chi2 score in sklearn produce something

How to programmatically determine the column indices of principal components using FactoMineR package?

ぃ、小莉子 提交于 2019-12-06 03:40:17
问题 Given a data frame containing mixed variables (i.e. both categorical and continuous) like, digits = 0:9 # set seed for reproducibility set.seed(17) # function to create random string createRandString <- function(n = 5000) { a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE)) } df <- data.frame(ID=c(1:10), name=sample(letters[1:10]), studLoc=sample(createRandString(10)), finalmark=sample(c(0:100),10),

Putting together sklearn pipeline+nested cross-validation for KNN regression

与世无争的帅哥 提交于 2019-12-06 01:38:47
I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor that includes: normalize features feature selection (best subset of 20 numeric features, no specific total) cross-validates hyperparameter K in range 1 to 20 cross-validates model uses RMSE as error metric There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need. Besides sklearn.neighbors.KNeighborsRegressor , I think I need: sklearn.pipeline.Pipeline sklearn.preprocessing.Normalizer sklearn.model_selection.GridSearchCV sklearn.model_selection

Doing hyperparameter estimation for the estimator in each fold of Recursive Feature Elimination

倾然丶 夕夏残阳落幕 提交于 2019-12-05 10:39:16
I am using sklearn to carry out recursive feature elimination with cross-validation, using the RFECV module. RFE involves repeatedly training an estimator on the full set of features, then removing the least informative features, until converging on the optimal number of features. In order to obtain optimal performance by the estimator, I want to select the best hyperparameters for the estimator for each number of features (edited for clarity). The estimator is a linear SVM so I am only looking into the C parameter. Initially, my code was as follows. However, this just did one grid search for

Feature selection + cross-validation, but how to make ROC-curves in R

戏子无情 提交于 2019-12-05 09:59:45
问题 I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features. So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this? Silke 回答1: