cross-validation | 易学教程

sklearn: User defined cross validation for time series data

阅读更多关于 sklearn: User defined cross validation for time series data

问题 I'm trying to solve a machine learning problem. I have a specific dataset with time-series element. For this problem I'm using well-known python library - sklearn . There are a lot of cross validation iterators in this library. Also there are several iterators for defining cross validation yourself. The problem is that I don't really know how to define simple cross validation for time series. Here is a good example of what I'm trying to get: Suppose we have several periods (years) and we want

return coefficients from Pipeline object in sklearn

阅读更多关于 return coefficients from Pipeline object in sklearn

问题 I've fit a Pipeline object with RandomizedSearchCV pipe_sgd = Pipeline([('scl', StandardScaler()), ('clf', SGDClassifier(n_jobs=-1))]) param_dist_sgd = {'clf__loss': ['log'], 'clf__penalty': [None, 'l1', 'l2', 'elasticnet'], 'clf__alpha': np.linspace(0.15, 0.35), 'clf__n_iter': [3, 5, 7]} sgd_randomized_pipe = RandomizedSearchCV(estimator = pipe_sgd, param_distributions=param_dist_sgd, cv=3, n_iter=30, n_jobs=-1) sgd_randomized_pipe.fit(X_train, y_train) I want to access the coef_ attribute

Does sklearn LogisticRegressionCV use all data for final model

阅读更多关于 Does sklearn LogisticRegressionCV use all data for final model

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that Xdata # shape of this is (n_samples,n_features) ylabels # shape of this is (n_samples,), and it is binary and now I run from sklearn.linear_model import LogisticRegressionCV clf = LogisticRegressionCV(Cs=[1.0],cv=5) clf.fit(Xdata,ylabels) This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds

Why does calling the KFold generator with shuffle give the same indices?

阅读更多关于 Why does calling the KFold generator with shuffle give the same indices?

With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this? Example: from sklearn.cross_validation import KFold X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4]) kf = KFold(4, n_folds=2, shuffle = True) for fold in kf: print fold print '---second round----' for fold in kf: print fold Output: (array([2, 3]), array([0, 1])) (array([0, 1]), array([2, 3])) --

Scikit-learn, GroupKFold with shuffling groups?

阅读更多关于 Scikit-learn, GroupKFold with shuffling groups?

问题 I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold. Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold. Shuffling is in group sence - so whole groups should be shuffle among each other. Is there way to do is with scikit-learn elegant

What do you need to watch out for when using cross-validation with GLM lambda search?

阅读更多关于 What do you need to watch out for when using cross-validation with GLM lambda search?

问题 Regarding h2o.glm lambda search not appearing to iterate over all lambdas, I read the question as complaining that lambda was too high; they tried setting early_stopping=F in the hope that might fix that "bug". Isn't it the case that the original behaviour was a feature, not a bug? And if that is correct, then you should always use early_stopping=T when using cross-validation with GLM, otherwise the error estimate from cross-validation is useless; you also risk over-fitting. (My main question

how to obtain the trained best model from a crossvalidator

阅读更多关于 how to obtain the trained best model from a crossvalidator

I built a pipeline including a DecisionTreeClassifier(dt) like this val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter)) Then I used this pipeline as the estimator in a CrossValidator in order to get a model with the best set of hyperparameters like this val c_v = new CrossValidator().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(5) Finally, I could train a model on a training test with this crossvalidator val model =

Scikit-learn: scoring in GridSearchCV

阅读更多关于 Scikit-learn: scoring in GridSearchCV

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 2 years ago . It seems that GridSearchCV of scikit-learn collects the scores of its (inner) cross-validation folds and then averages across the scores of all folds. I was wondering about the rationale behind this. At first glance, it would seem more flexible to instead collect the predictions of its cross-validation folds and then apply the chosen scoring metric to the predictions of all

How to access Scikit Learn nested cross-validation scores

阅读更多关于 How to access Scikit Learn nested cross-validation scores

问题 I'm using python and I would like to use nested cross-validation with scikit learn. I have found a very good example: NUM_TRIALS = 30 non_nested_scores = np.zeros(NUM_TRIALS) nested_scores = np.zeros(NUM_TRIALS) # Choose cross-validation techniques for the inner and outer loops, # independently of the dataset. # E.g "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc. inner_cv = KFold(n_splits=4, shuffle=True, random_state=i) outer_cv = KFold(n_splits=4, shuffle=True, random_state=i) # Non

How to plot a learning curve for a keras experiment?

阅读更多关于 How to plot a learning curve for a keras experiment?

I'm training an RNN using keras and would like to see how the validation accuracy changes with the data set size. Keras has a list called val_acc in its history object which gets appended after every epoch with the respective validation set accuracy ( link to the post in google group ). I want to get the average of val_acc for the number of epochs run and plot that against the respective data set size. Question: How can I retrieve the elements in the val_acc list and perform an operation like numpy.mean(val_acc) ? EDIT: As @runDOSrun said, getting the mean of the val_acc s doesn't make sense.