cross-validation

How to extract best parameters from a CrossValidatorModel

↘锁芯ラ 提交于 2019-12-17 21:54:07
问题 I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x, In Pipeline Example in Spark documentation, they add different parameters ( numFeatures , regParam ) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model: val cvModel = crossval.fit(training.toDF) Now, I want to know what are the parameters ( numFeatures , regParam ) from ParamGridBuilder that produces the best model. I already used the

Cross-validation metrics in scikit-learn for each data split

风格不统一 提交于 2019-12-17 21:15:37
问题 Please, I just need to get the cross-validation statistics explicitly for each split of the (X_test, y_test) data. So, to try to do so I did: kf = KFold(n_splits=n_splits) X_train_tmp = [] y_train_tmp = [] X_test_tmp = [] y_test_tmp = [] mae_train_cv_list = [] mae_test_cv_list = [] for train_index, test_index in kf.split(X_train): for i in range(len(train_index)): X_train_tmp.append(X_train[train_index[i]]) y_train_tmp.append(y_train[train_index[i]]) for i in range(len(test_index)): X_test

ValueError: n_splits=10 cannot be greater than the number of members in each class

时光毁灭记忆、已成空白 提交于 2019-12-17 19:25:51
问题 I am trying to run the following code: from sklearn.model_selection import StratifiedKFold X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"] y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"] skf = StratifiedKFold(n_splits=10) for train, test in skf.split(X,y): print("%s %s" % (train,test)) But I get the following error: ValueError: n_splits=10 cannot be

Cross-validation in LightGBM

佐手、 提交于 2019-12-17 19:18:50
问题 After reading through LightGBM's documentation on cross-validation, I'm hoping this community can shed light on cross-validating results and improving our predictions using LightGBM. How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions? Here's an example - we train our cv model using the code below: cv_mod = lgb.cv(params, d_train, 500, nfold = 10, early_stopping_rounds = 25, stratified = True) How can we use the parameters found from the best iteration

How to cross validate RandomForest model?

倖福魔咒の 提交于 2019-12-17 18:34:50
问题 I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? 回答1: ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml

How does Caret generate an OLS model with K-fold cross validation?

本秂侑毒 提交于 2019-12-14 03:59:05
问题 Let's say I have some generic dataset for which an OLS regression is the best choice. So, I generate a model with some first-order terms and decide to use Caret in R for my regression coefficient estimates and error estimates. In caret, this ends up being: k10_cv = trainControl(method="cv", number=10) ols_model = train(Y ~ X1 + X2 + X3, data = my_data, trControl = k10_cv, method = "lm") From there, I can pull out regression information using summary(ols_model) and can also pull some more

Get standard deviation for a GridSearchCV

我是研究僧i 提交于 2019-12-14 03:49:13
问题 Before scikit-learn 0.20 we could use result.grid_scores_[result.best_index_] to get the standard deviation. (It returned for exemple: mean: 0.76172, std: 0.05225, params: {'n_neighbors': 21} ) What's the best way in scikit-learn 0.20 to get the standard deviation of the best score ? 回答1: In newer versions, the grid_scores_ is renamed as cv_results_ . Following the documentation, you need this: best_index_ : int The index (of the cv_results_ arrays) which corresponds to the best > candidate

Test set and train set for each fold in Caret cross validation

拟墨画扇 提交于 2019-12-14 02:02:20
问题 I tried to understand the 5 fold cross validation algorithm in Caret package but I could not find out how to get train set and test set for each fold and I also could not find this from the similar suggested questions. Imagine if I want to do cross validation by random forest method, I do the following: set.seed(12) train_control <- trainControl(method="cv", number=5,savePredictions = TRUE) rfmodel <- train(Species~., data=iris, trControl=train_control, method="rf") first_holdout <- subset

Error: __init__() got an unexpected keyword argument 'n_splits'

馋奶兔 提交于 2019-12-13 11:01:46
问题 I am going to perform ShuffleSplit() method for California housing dataset (Source: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) to fit SGD regression. However, the 'n_splits' error is occurred when method is applied. The code is following: from sklearn import cross_validation, grid_search, linear_model, metrics import numpy as np import pandas as pd from sklearn.preprocessing import scale from sklearn.cross_validation import ShuffleSplit housing_data = pd.read_csv('cal

The tuning parameter in “glm” vs “rf”

给你一囗甜甜゛ 提交于 2019-12-13 07:43:25
问题 I am trying to build a classification model using method = "glm" in train . When I use method = "rpart" it works fine but when I switch to method = "glm" then it gives me an error saying The tuning parameter grid should have columns parameter I tried using cpGrid = data.frame(.0001) also cpGrid = data.frame(expand.grid(.cp = seq(.0001, .09, .001))) But both throwing an error. Below is my initial code numFolds = trainControl(method = "cv", number = 10, repeats = 3) cpGrid = expand.grid(.cp =