cross-validation

How to extract important features after k-fold cross validation, with or without a pipeline?

爷,独闯天下 提交于 2019-12-11 06:29:00
问题 I want to build a classifier that uses cross validation, and then extract the important features (/coefficients) from each fold so I can look at their stability. At the moment I am using cross_validate and a pipeline. I want to use a pipeline so that I can do feature selection and standardization within each fold. I'm stuck on how to extract the features from each fold. I have a different option to using a pipeline below, if that is the problem. This is my code so far (I want to try SVM and

Loop to implement Leave-One-Out observation and run glm, one variable at a time

你离开我真会死。 提交于 2019-12-11 05:50:11
问题 I have a data frame with 96 observations and 1106 variables. I would like to run logistic regression on the observations by leaving one out , one at a time. (So for the first set of observations there would be 95 total with the first observation removed, the second set of observations there would be 95 total with the second observation removed, and so forth so that there are 95 sets of observations that each have one observation left out.) In addition, I would like to run each set of these

Creating folds manually for K-fold cross-validation R

痴心易碎 提交于 2019-12-11 05:28:10
问题 I am trying to make a K-fold CV regression model using K=5. I tried using the "boot" package cv.glm function, but my pc ran out of memory because the boot package always computes a LOOCV MSE next to it. So I decided to do it manually, but I ran in to the following problem. I try to divide my dataframe into 5 vectors of equal length containing a sample of 1/5 of the rownumbers of my df, but i get unexplainable lengths from the 3rd fold. a <- sample((d<-1:1000), size = 100, replace = FALSE) b <

scikit-learn LogisticRegressionCV: best coefficients

情到浓时终转凉″ 提交于 2019-12-11 05:09:40
问题 I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True. If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that

Getting features in RFECV scikit-learn

百般思念 提交于 2019-12-11 01:56:10
问题 Inspired by this: http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py I am wondering if there is anyway to get the features for a particular score: In that case, I would like to know, which 10 features selected gives that peak when #Features = 10. Any ideas? EDIT: This is the code used to get that plot: from sklearn.feature_selection import RFECV from sklearn.model

sklearn TimeSeriesSplit cross_val_predict only works for partitions

自古美人都是妖i 提交于 2019-12-11 01:41:25
问题 I am trying to use the TimeSeriesSplit cross-validation strategy in sklearn version 0.18.1 with a LogisticRegression estimator. I get an error stating that: cross_val_predict only works for partitions The following code snippet shows how to reproduce: from sklearn import linear_model, neighbors from sklearn.model_selection import train_test_split, cross_val_predict, TimeSeriesSplit, KFold, cross_val_score import pandas as pd import numpy as np from datetime import date, datetime df = pd

Predicted values of each fold in K-Fold Cross Validation in sklearn

喜你入骨 提交于 2019-12-10 18:26:34
问题 I have performed 10-fold cross validation on a dataset that I have using python sklearn, result = cross_val_score(best_svr, X, y, cv=10, scoring='r2') print(result.mean()) I have been able to get the mean value of the r2 score as the final result. I want to know if there is a way to print out the predicted values for each fold( in this case 10 sets of values). 回答1: I believe you are looking for the cross_val_predict function. 回答2: To print the predictions for each fold, for k in range(2,10):

How to get classes labels from cross_val_predict used with predict_proba in scikit-learn

丶灬走出姿态 提交于 2019-12-10 15:02:25
问题 I need to train a Random Forest classifier using a 3-fold cross-validation. For each sample, I need to retrieve the prediction probability when it happens to be in the test set. I am using scikit-learn version 0.18.dev0. This new version adds the feature to use the method cross_val_predict() with an additional parameter method to define which kind of prediction require from the estimator. In my case I want to use the predict_proba() method, which returns the probability for each class, in a

How to speed up nested cross validation in python?

放肆的年华 提交于 2019-12-10 13:34:56
问题 From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question. I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n

why is sklearn.feature_selection.RFECV giving different results for each run

吃可爱长大的小学妹 提交于 2019-12-10 11:55:47
问题 I tried to do feature selection with RFECV but it is giving out different results each time, does cross-validation divide the sample X into random chunks or into sequential deterministic chunks? Also, why is the score different for grid_scores_ and score(X,y) ? why are the scores sometimes negative? 回答1: Does cross-validation divide the sample X into random chunks or into sequential deterministic chunks? CV divides the data into deterministic chunks by default. You can change this behaviour