cross-validation

Nested cross validation with StratifiedShuffleSplit in sklearn

和自甴很熟 提交于 2020-01-01 07:26:23
问题 I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively. My code is quite simple: >> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))]) >> parameters = {'clf__C': logspace(-4,1,50)} >> grid_search =

Calculate cross validation for Generalized Linear Model in Matlab

妖精的绣舞 提交于 2020-01-01 06:48:25
问题 I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far; x = 'Some dataset, containing the input and the output' X = x(:,1:7); Y = x(:,8); cvpart = cvpartition(Y,'holdout',0.3); Xtrain = X(training(cvpart),:); Ytrain = Y(training(cvpart),:); Xtest = X(test(cvpart),:); Ytest = Y(test(cvpart),:); mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson'); Ypred = predict(mdl,Xtest); res = (Ypred - Ytest);

What does KFold in python exactly do?

邮差的信 提交于 2020-01-01 04:39:09
问题 I am looking at this tutorial: https://www.dataquest.io/mission/74/getting-started-with-kaggle I got to part 9, making predictions. In there there is some data in a dataframe called titanic, which is then divided up in folds using: # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(titanic.shape[0], n_folds=3, random_state=1) I am not

Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

牧云@^-^@ 提交于 2019-12-31 21:43:10
问题 I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting. As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this: import numpy as np import pandas as pd from sklearn import model_selection import xgboost

sklearn Kfold acces single fold instead of for loop

杀马特。学长 韩版系。学妹 提交于 2019-12-31 20:54:08
问题 After using cross_validation.KFold(n, n_folds=folds) I would like to access the indexes for training and testing of single fold, instead of going through all the folds. So let's take the example code: from sklearn import cross_validation X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4]) kf = cross_validation.KFold(4, n_folds=2) >>> print(kf) sklearn.cross_validation.KFold(n=4, n_folds=2, shuffle=False, random_state=None) >>> for train_index, test_index in kf: I would

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

六月ゝ 毕业季﹏ 提交于 2019-12-31 13:47:27
问题 I have a matrix with 20 columns. The last column are 0/1 labels. The link to the data is here. I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this: using sklearn.cross_validation.cross_val_score using sklearn.cross_validation.train_test_split I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below. import csv

caret coefficients of cross validated set

末鹿安然 提交于 2019-12-25 01:56:01
问题 Is it possible to get the coefficients of all the cross validation set from R Caret package? set.seed(1) mu <- rep(0, 4) Sigma <- matrix(.7, nrow=4, ncol=4) diag(Sigma) <- 1 rawvars <- mvrnorm(n=1000, mu=mu, Sigma=Sigma) d <- as.ordered( as.numeric(rawvars[,1]>0.5) ) d[1:200] <- 1 df <- data.frame(rawvars, d) ind <- sample(1:nrow(df), 500) train <- df[ind,] test <- df[-ind,] trControl <- trainControl(method = "repeatedcv", repeats = 1, classProb = T, summaryFunction= twoClassSummary) fit

Implementing cross-validation in java

假如想象 提交于 2019-12-24 20:35:22
问题 I use Spring Roo + jpa + hibernate and I would like to implement cross-validation (validation of several fields at the same time) in my application. I am not sure how to go about implementing it. Can anyone please advise me and/or direct me to relevant documentation? 回答1: Have a look at Hibernate Validator, which allows entity validation (using annotations). http://www.hibernate.org/subprojects/validator.html In short, you annotate your field constraints by placing hibernate validator/ JPA

Custom Evaluator during cross validation SPARK

蹲街弑〆低调 提交于 2019-12-24 09:47:40
问题 My aim is to add a rank based evaluator to the CrossValidator function (PySpark) cvExplicit = CrossValidator(estimator=cvSet, numFolds=8, estimatorParamMaps=paramMap,evaluator=rnkEvaluate) Although I need to pass the evaluated dataframe into the function, and I do not know how to do that part. class rnkEvaluate(): def __init__(self, user_col = "user", rating_col ="rating", prediction_col = "prediction"): print(user_col) print(rating_col) print(prediction_col) def isLargerBetter(): return True

How can slice the time-series data with multi-features to get continuous plot contains [train +test+prediction]?

妖精的绣舞 提交于 2019-12-24 06:43:54
问题 I have the formatted dataset which looks like a matrix[NxM] where N = 40 total number of cycles(time-stamps) and M = 1440 pixels. For every cycle, I have 1440 pixel values corresponding to 1440 pixels. I've used different models to predict pixel values of future cycles based on past 10 cycles. The problem is I couldn't achieve proper continuous plot after training NN most probably due to bad data split technique I've used via train_test_split but never tried by TimeSeriesSplit as following :