cross-validation

Order between using validation, training and test sets

六眼飞鱼酱① 提交于 2019-11-27 09:38:26
I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used. Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters). In this wikipedia article , it seems to imply that the sequence should be: Split data into training set, validation set and test set Use the training set to fit the model (find the best parameters: coefficients of the polynomial). Afterwards , use the validation set to find the best

scikit-learn cross validation, negative values with mean squared error

半腔热情 提交于 2019-11-27 06:58:19
When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea? from sklearn.svm import SVR from sklearn import cross_validation as CV reg = SVR(C=1., epsilon=0.1, kernel='rbf') scores = CV.cross_val_score(reg, X, y, cv=10, scoring='mean_squared_error') all values in scores are then negative. AN6U5 Trying to close this out, so am providing the answer that David and larsmans have eloquently described in the comments section: Yes, this is supposed to happen. The

Model help using Scikit-learn when using GridSearch

喜欢而已 提交于 2019-11-27 04:28:51
问题 As part of the Enron project, built the attached model, Below is the summary of the steps, Below model gives highly perfect scores cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42) gcv = GridSearchCV(pipe, clf_params,cv=cv) gcv.fit(features,labels) ---> with the full dataset for train_ind, test_ind in cv.split(features,labels): x_train, x_test = features[train_ind], features[test_ind] y_train, y_test = labels[train_ind],labels[test_ind] gcv.best_estimator_

Split tensor into training and test sets

给你一囗甜甜゛ 提交于 2019-11-27 02:42:30
问题 Let's say I've read in a textfile using a TextLineReader . Is there some way to split this into train and test sets in Tensorflow ? Something like: def read_my_file_format(filename_queue): reader = tf.TextLineReader() key, record_string = reader.read(filename_queue) raw_features, label = tf.decode_csv(record_string) features = some_processing(raw_features) features_train, labels_train, features_test, labels_test = tf.train_split(features, labels, frac=.1) return features_train, labels_train,

predict_proba for a cross-validated model

依然范特西╮ 提交于 2019-11-27 02:13:19
问题 I would like to predict the probability from Logistic Regression model with cross-validation. I know you can get the cross-validation scores, but is it possible to return the values from predict_proba instead of the scores? # imports from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import (StratifiedKFold, cross_val_score, train_test_split) from sklearn import datasets # setup data iris = datasets.load_iris() X = iris.data y = iris.target # setup model cv =

(Python - sklearn) How to pass parameters to the customize ModelTransformer class by gridsearchcv

橙三吉。 提交于 2019-11-27 00:20:11
问题 Below is my pipeline and it seems that I can't pass the parameters to my models by using the ModelTransformer class, which I take it from the link (http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html) The error message makes sense to me, but I don't know how to fix this. Any idea how to fix this? Thanks. # define a pipeline pipeline = Pipeline([ ('vect', DictVectorizer(sparse=False)), ('scale', preprocessing.MinMaxScaler()), ('ess', FeatureUnion(n_jobs=-1,

Topic models: cross validation with loglikelihood or perplexity

*爱你&永不变心* 提交于 2019-11-26 23:55:11
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 5 years ago . I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to

Does GridSearchCV perform cross-validation?

本秂侑毒 提交于 2019-11-26 20:14:22
问题 I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train . First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set? Second question, I know that GridSearchCV uses K-fold in its' implementation,

Spark CrossValidatorModel access other models than the bestModel?

你说的曾经没有我的故事 提交于 2019-11-26 17:24:24
问题 I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also

scikit-learn cross validation, negative values with mean squared error

不羁的心 提交于 2019-11-26 10:33:06
问题 When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea? from sklearn.svm import SVR from sklearn import cross_validation as CV reg = SVR(C=1., epsilon=0.1, kernel=\'rbf\') scores = CV.cross_val_score(reg, X, y, cv=10, scoring=\'mean_squared_error\') all values in scores are then negative. 回答1: Trying to close this out, so am providing the answer that