cross-validation | 易学教程

Nested cross validation with StratifiedShuffleSplit in sklearn

阅读更多关于 Nested cross validation with StratifiedShuffleSplit in sklearn

问题 I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively. My code is quite simple: >> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))]) >> parameters = {'clf__C': logspace(-4,1,50)} >> grid_search =

Calculate cross validation for Generalized Linear Model in Matlab

阅读更多关于 Calculate cross validation for Generalized Linear Model in Matlab

问题 I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far; x = 'Some dataset, containing the input and the output' X = x(:,1:7); Y = x(:,8); cvpart = cvpartition(Y,'holdout',0.3); Xtrain = X(training(cvpart),:); Ytrain = Y(training(cvpart),:); Xtest = X(test(cvpart),:); Ytest = Y(test(cvpart),:); mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson'); Ypred = predict(mdl,Xtest); res = (Ypred - Ytest);

What does KFold in python exactly do?

阅读更多关于 What does KFold in python exactly do?

问题 I am looking at this tutorial: https://www.dataquest.io/mission/74/getting-started-with-kaggle I got to part 9, making predictions. In there there is some data in a dataframe called titanic, which is then divided up in folds using: # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(titanic.shape[0], n_folds=3, random_state=1) I am not

Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

阅读更多关于 Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

问题 I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting. As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this: import numpy as np import pandas as pd from sklearn import model_selection import xgboost

sklearn Kfold acces single fold instead of for loop

阅读更多关于 sklearn Kfold acces single fold instead of for loop

问题 After using cross_validation.KFold(n, n_folds=folds) I would like to access the indexes for training and testing of single fold, instead of going through all the folds. So let's take the example code: from sklearn import cross_validation X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4]) kf = cross_validation.KFold(4, n_folds=2) >>> print(kf) sklearn.cross_validation.KFold(n=4, n_folds=2, shuffle=False, random_state=None) >>> for train_index, test_index in kf: I would

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

阅读更多关于 Difference between using train_test_split and cross_val_score in sklearn.cross_validation

问题 I have a matrix with 20 columns. The last column are 0/1 labels. The link to the data is here. I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this: using sklearn.cross_validation.cross_val_score using sklearn.cross_validation.train_test_split I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below. import csv

caret coefficients of cross validated set

阅读更多关于 caret coefficients of cross validated set

问题 Is it possible to get the coefficients of all the cross validation set from R Caret package? set.seed(1) mu <- rep(0, 4) Sigma <- matrix(.7, nrow=4, ncol=4) diag(Sigma) <- 1 rawvars <- mvrnorm(n=1000, mu=mu, Sigma=Sigma) d <- as.ordered( as.numeric(rawvars[,1]>0.5) ) d[1:200] <- 1 df <- data.frame(rawvars, d) ind <- sample(1:nrow(df), 500) train <- df[ind,] test <- df[-ind,] trControl <- trainControl(method = "repeatedcv", repeats = 1, classProb = T, summaryFunction= twoClassSummary) fit

Implementing cross-validation in java

阅读更多关于 Implementing cross-validation in java

问题 I use Spring Roo + jpa + hibernate and I would like to implement cross-validation (validation of several fields at the same time) in my application. I am not sure how to go about implementing it. Can anyone please advise me and/or direct me to relevant documentation? 回答1: Have a look at Hibernate Validator, which allows entity validation (using annotations). http://www.hibernate.org/subprojects/validator.html In short, you annotate your field constraints by placing hibernate validator/ JPA

Custom Evaluator during cross validation SPARK

阅读更多关于 Custom Evaluator during cross validation SPARK

问题 My aim is to add a rank based evaluator to the CrossValidator function (PySpark) cvExplicit = CrossValidator(estimator=cvSet, numFolds=8, estimatorParamMaps=paramMap,evaluator=rnkEvaluate) Although I need to pass the evaluated dataframe into the function, and I do not know how to do that part. class rnkEvaluate(): def __init__(self, user_col = "user", rating_col ="rating", prediction_col = "prediction"): print(user_col) print(rating_col) print(prediction_col) def isLargerBetter(): return True

How can slice the time-series data with multi-features to get continuous plot contains [train +test+prediction]?

阅读更多关于 How can slice the time-series data with multi-features to get continuous plot contains [train +test+prediction]?

问题 I have the formatted dataset which looks like a matrix[NxM] where N = 40 total number of cycles(time-stamps) and M = 1440 pixels. For every cycle, I have 1440 pixel values corresponding to 1440 pixels. I've used different models to predict pixel values of future cycles based on past 10 cycles. The problem is I couldn't achieve proper continuous plot after training NN most probably due to bad data split technique I've used via train_test_split but never tried by TimeSeriesSplit as following :