cross-validation | 易学教程

Function for cross validation and oversampling (SMOTE)

阅读更多关于 Function for cross validation and oversampling (SMOTE)

问题 I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1) . y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE. def Learning(n, est, X, y): s_k_fold = StratifiedKFold(n_splits = n) acc_scores = [] rec_scores = [] f1_scores = [] for train_index, test_index in s_k_fold.split(X, y): X_train = X[train_index] y_train = y[train_index] sm = SMOTE(random_state=42) X_resampled, y_resampled = sm.fit_resample

How to use CrossValidator to choose between different models

阅读更多关于 How to use CrossValidator to choose between different models

问题 I know that I can use a CrossValidator to tune a single model. But what is the suggested approach for evaluating different models against each other? For example, say that I wanted to evaluate a LogisticRegression classifier against a LinearSVC classifier using CrossValidator. 回答1: After familiarizing myself a bit with the API, I solved this problem by implementing a custom Estimator that wraps two or more estimators it can delegate to, where the selected estimator is controlled by a single

Cross validation for glm() models

阅读更多关于 Cross validation for glm() models

问题 I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula: library(boot) cv.glm(data, glmfit, K=10) Does the "data" argument here refer to the whole dataset or only to the test set? The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10

Cross-validation predictions from caret in assigned to different folds

阅读更多关于 Cross-validation predictions from caret in assigned to different folds

问题 I am wondering why predictions from 'Fold1' are actually predictions from the second fold in my predefined folds. I attach an example of what I mean. # load the library library(caret) # load the cars dataset data(cars) # define folds cv_folds <- createFolds(cars$Price, k = 5, list = TRUE, returnTrain = TRUE) # define training control train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final') # fix the parameters of the algorithm # train the model model <- caret:

How to do cross-validation with multiple input data in CNN model with Keras

阅读更多关于 How to do cross-validation with multiple input data in CNN model with Keras

问题 My dataset consists of time series(10080) and other descriptive statistics features(85) joint into one row . DataFrame is 921 x 10166 . The data looks something like this, with last 2 columns as Y (labels). id x0 x1 x2 x3 x4 x5 ... x10079 mean var ... Y0 Y1 1 40 31.05 25.5 25.5 25.5 25 ... 33 24 1 1 0 2 35 35.75 36.5 26.5 36.5 36.5 ... 29 31 2 0 1 3 35 35.70 36.5 36.5 36.5 36.5 ... 29 25 1 1 0 4 40 31.50 23.5 24.5 26.5 25 ... 33 29 3 0 1 ... 921 40 31.05 25.5 25.5 25.5 25 ... 23 33 2 0 1 I

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

阅读更多关于 Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

问题 I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set. df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05), Time = seq(1:20)) The reason I want to do this is that I am only really interested in how well the model is predicting the

How to perform SMOTE with cross validation in sklearn in python

阅读更多关于 How to perform SMOTE with cross validation in sklearn in python

问题 I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE. Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation. My current code is as follows. However, as mentioned above it only uses single iteration. from imblearn.over_sampling import SMOTE from sklearn

Crossvalidation in Stanford NER

阅读更多关于 Crossvalidation in Stanford NER

问题 I'm trying to use cross validation in Stanford NER. The feature factory lists 3 properties: numFolds int 1 The number of folds to use for cross-validation. startFold int 1 The starting fold to run. numFoldsToRun int 1 The number of folds to run. which I think should be used for cross validation. But I don't think they actually work. Setting numFolds to 1 or 10 doesn't change the running time for training at all. And strangely, using numFoldsToRun gives the following warning: Unknown property:

How to do repeatable sampling in BigQuery Standard SQL?

阅读更多关于 How to do repeatable sampling in BigQuery Standard SQL?

问题 In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data. However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL. How would you re-write the blog's sampling code shown below, but using Standard SQL? #legacySQL SELECT date, airline, departure_airport, departure_schedule, arrival_airport, arrival

Stratified Cross validation of timeseries data

阅读更多关于 Stratified Cross validation of timeseries data

问题 I want to do a time series cross validation based on group (grp column). In the below sample data, Temperature is my target variable import numpy as np import pandas as pd timeS=pd.date_range(start='1980-01-01 00:00:00', end='1980-01-01 00:00:05', freq='S') df = pd.DataFrame(dict(time=timeS, grp=['A']*3 + ['B']*3, material=[1,2,3]*2, temperature=['2.4','5','9.9']*2)) grp material temperature time 0 A 1 2.4 1980-01-01 00:00:00 1 A 2 5 1980-01-01 00:00:01 2 A 3 9.9 1980-01-01 00:00:02 3 B 1 2.4