cross-validation

Cross Validation metrics with Pyspark

谁说胖子不能爱 提交于 2019-12-06 09:17:32
When we do a k-fold Cross Validation we are testing how well a model behaves when it comes to predict data it has never seen. If split my dataset in 90% training and 10% test and analyse the model performance, there is no guarantee that my test set doesn't contain only the 10% "easiest" or "hardest" points to predict. By doing a 10-fold cross validation I can be assured that every point will at least be used once for training. As (in this case) the model will be tested 10 times we can do an analysis of those tests metrics which will provide us with a better understanding of how the model is

Using cross validation and AUC-ROC for a logistic regression model in sklearn

亡梦爱人 提交于 2019-12-06 05:17:48
问题 I'm using the sklearn package to build a logistic regression model and then evaluate it. Specifically, I want to do so using cross validation, but can't figure out the right way to do so with the cross_val_score function. According to the documentation and some examples I saw, I need to pass the function the model, the features, the outcome, and a scoring method. However, the AUC doesn't need predictions, it needs probabilities, so it can try different threshold values and calculate the ROC

caret: combine createResample and groupKFold

主宰稳场 提交于 2019-12-06 04:20:44
I want to do a custom sampling with caret . My specifications are the following: I have 1 observation per day, and my grouping factor is the month (12 values); so in the first step I create 12 resamples with 11 months in the training (11*30 points) and 1 in the testing (30 points). This way I get 12 resamples in total. But that's not enough to me and I would like to make it a little more complex, by adding some bootstrapping of the training points of each partition. So, instead of having 11*30 points in Resample01, I would have several bootstrapped resamples of these 330 points. So in the end,

How to plot a learning curve for a keras experiment?

匆匆过客 提交于 2019-12-05 17:59:32
问题 I'm training an RNN using keras and would like to see how the validation accuracy changes with the data set size. Keras has a list called val_acc in its history object which gets appended after every epoch with the respective validation set accuracy (link to the post in google group). I want to get the average of val_acc for the number of epochs run and plot that against the respective data set size. Question: How can I retrieve the elements in the val_acc list and perform an operation like

Feature selection + cross-validation, but how to make ROC-curves in R

戏子无情 提交于 2019-12-05 09:59:45
问题 I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features. So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this? Silke 回答1:

Randomized stratified k-fold cross-validation in scikit-learn?

陌路散爱 提交于 2019-12-05 07:44:07
Is there any built-in way to get scikit-learn to perform shuffled stratified k-fold cross-validation? This is one of the most common CV methods, and I am surprised I couldn't find a built-in method to do this. I saw that cross_validation.KFold() has a shuffling flag, but it is not stratified. Unfortunately cross_validation.StratifiedKFold() does not have such an option, and cross_validation.StratifiedShuffleSplit() does not produce disjoint folds. Am I missing something? Is this planned? (obviously I can implement this by myself) The shuffling flag for cross_validation.StratifiedKFold has been

Classification table for logistic regression in R

给你一囗甜甜゛ 提交于 2019-12-05 06:25:59
This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 6 years ago . I have a data set consisting of a dichotomous depending variable ( Y ) and 12 independent variables ( X1 to X12 ) stored in a csv file. Here are the first 5 rows of the data: Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12 0,9,3.86,111,126,14,13,1,7,7,0,M,46-50 1,7074,3.88,232,4654,143,349,2,27,18,6,M,25-30 1,5120,27.45,97,2924,298,324,3,56,21,0,M,31-35 1,18656,79.32,408,1648,303,8730,286,294,62,28,M,25-30 0,3869,21.23,260,2164,550,320,3,42,203,3,F,18-24 I constructed a logistic

Scikit F-score metric error

故事扮演 提交于 2019-12-05 05:32:40
I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result. [Input] X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6) logistic = LogisticRegressionCV( Cs=50, cv=4, penalty='l2', fit_intercept=True, scoring='f1' ) logistic.fit(X_training, y_training) print('Predicted: %s' % str(logistic.predict(X_test))) print('F1-score: %f'% f1_score(y_test, logistic.predict(X_test))) print('Accuracy

keras/scikit-learn: using fit_generator() with cross validation

我怕爱的太早我们不能终老 提交于 2019-12-05 02:23:28
问题 Is it possible to use Keras's scikit-learn API together with fit_generator() method? Or use another way to yield batches for training? I'm using SciPy's sparse matrices which must be converted to NumPy arrays before input to Keras, but I can't convert them at the same time because of high memory consumption. Here is my function to yield batches: def batch_generator(X, y, batch_size): n_splits = len(X) // (batch_size - 1) X = np.array_split(X, n_splits) y = np.array_split(y, n_splits) while

Reproducible splitting of data into training and testing in R

本秂侑毒 提交于 2019-12-04 22:26:00
A common way for sampling/splitting data in R is using sample , e.g., on row numbers. For example: require(data.table) set.seed(1) population <- as.character(1e5:(1e6-1)) # some made up ID names N <- 1e4 # sample size sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids test <- sample(N-1, N/2, replace = F) test1 <- sample1[test, .(id)] The problem is that this isn't very robust to changes in the data. For example if we drop just one observation: sample2 <- sample1[-sample(N, 1)] samples 1 and 2 are still all but identical: nrow(merge(sample1, sample2)) [1] 9999 Yet