cross-validation

Timeseries Crossvalidation in R: using tsCV() with tslm()-Models

六眼飞鱼酱① 提交于 2019-12-08 07:16:48
问题 I am currently trying to evaluate a tslm-model using timeseries cross validation. I want to use a fixed model (without parameter reestimation) an look at the 1 to 3 step ahead horizon forecasts for the evaluation period of the last year. I have trouble to get tsCV and tslm from the forecast-library to work well together. What am I missing? library(forecast) library(ggfortify) AirPassengers_train <- head(AirPassengers, 100) AirPassengers_test <- tail(AirPassengers, 44) ## Holdout Evaluation n

Scikit learn GridSearchCV AUC performance

巧了我就是萌 提交于 2019-12-08 05:15:56
问题 I'm using GridSearchCV to identify the best set of parameters for a random forest classifier. PARAMS = { 'max_depth': [8,None], 'n_estimators': [500,1000] } rf = RandomForestClassifier() clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4) clf.fit(data, labels) where data and labels are respectively the full dataset and the corresponding labels. Now, I compared the performance returned by the GridSearchCV (from clf.grid_scores_ ) with a "manual"

Timeseries Crossvalidation in R: using tsCV() with tslm()-Models

99封情书 提交于 2019-12-08 04:41:27
I am currently trying to evaluate a tslm-model using timeseries cross validation. I want to use a fixed model (without parameter reestimation) an look at the 1 to 3 step ahead horizon forecasts for the evaluation period of the last year. I have trouble to get tsCV and tslm from the forecast-library to work well together. What am I missing? library(forecast) library(ggfortify) AirPassengers_train <- head(AirPassengers, 100) AirPassengers_test <- tail(AirPassengers, 44) ## Holdout Evaluation n_train <- length(AirPassengers_train) n_test <- length(AirPassengers_test) pred_train <- ts(rnorm(n

NaNs suddenly appearing for sklearn KFolds

折月煮酒 提交于 2019-12-08 04:39:16
问题 I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before? y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']] X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444) This is what my X data looked like before KFolds: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720

Cross Validation metrics with Pyspark

巧了我就是萌 提交于 2019-12-08 02:34:27
问题 When we do a k-fold Cross Validation we are testing how well a model behaves when it comes to predict data it has never seen. If split my dataset in 90% training and 10% test and analyse the model performance, there is no guarantee that my test set doesn't contain only the 10% "easiest" or "hardest" points to predict. By doing a 10-fold cross validation I can be assured that every point will at least be used once for training. As (in this case) the model will be tested 10 times we can do an

Alternate different models in Pipeline for GridSearchCV

懵懂的女人 提交于 2019-12-07 09:28:29
问题 I want to build a Pipeline in sklearn and test different models using GridSearchCV. Just an example (please do not pay attention on what particular models are chosen): reg = LogisticRegression() proj1 = PCA(n_components=2) proj2 = MDS() proj3 = TSNE() pipe = [('proj', proj1), ('reg' , reg)] pipe = Pipeline(pipe) param_grid = { 'reg__c': [0.01, 0.1, 1], } clf = GridSearchCV(pipe, param_grid = param_grid) Here if I want to try different models for dimensionality reduction, I need to code

H2O - balance classes - cross validation

醉酒当歌 提交于 2019-12-06 13:47:25
I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced. Thank you. In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic , i.e. you want to test your classifier performance in the real world setting, where, say, the

How to split an image datastore for cross-validation in MATLAB?

江枫思渺然 提交于 2019-12-06 13:39:47
In MATLAB the method splitEachLabel of an imageDatastore object splits an image data store into proportions per category label. How can one split an image data store for training using cross-validation and using the trainImageCategoryCalssifier class? I.e. it's easy to split it in N partitions, but then some sort of _mergeEachLabel_ functionality is needed to be able to train a classifier using cross-validation. Or is there another way of achieving that? Regards, Elena I stumbled on the same issue recently. Not sure if there is anyone still looking for a possible solution to this. I ended up

How to use CrossValidator to choose between different models

怎甘沉沦 提交于 2019-12-06 12:01:05
I know that I can use a CrossValidator to tune a single model. But what is the suggested approach for evaluating different models against each other? For example, say that I wanted to evaluate a LogisticRegression classifier against a LinearSVC classifier using CrossValidator . After familiarizing myself a bit with the API, I solved this problem by implementing a custom Estimator that wraps two or more estimators it can delegate to, where the selected estimator is controlled by a single Param[Int] . Here is the actual code: import org.apache.spark.ml.Estimator import org.apache.spark.ml.Model

Does sklearn LogisticRegressionCV use all data for final model

孤街醉人 提交于 2019-12-06 10:53:49
问题 I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that Xdata # shape of this is (n_samples,n_features) ylabels # shape of this is (n_samples,), and it is binary and now I run from sklearn.linear_model import LogisticRegressionCV clf = LogisticRegressionCV(Cs=[1.0],cv=5) clf.fit(Xdata,ylabels) This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be