cross-validation | 易学教程

How to compute accuracy and the confusion matrix using K-fold cross-validation?

阅读更多关于 How to compute accuracy and the confusion matrix using K-fold cross-validation?

问题 I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval? Could someone help me? My code is: import numpy as np from sklearn import model_selection from sklearn import datasets from sklearn import svm import pandas as pd from sklearn.linear_model import LogisticRegression UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv') previsores = UNSW.iloc[:,UNSW.columns

Tuning xgboost with xgb.train providing a validation set in R

阅读更多关于 Tuning xgboost with xgb.train providing a validation set in R

问题 Related questions here and here. The common way of tuning xgboost (i.e. nrounds) is using xgb.cv that performs k-fold cross validation, for example: require(xgboost) data(iris) set.seed(1) index = sample(1:150) X = as.matrix(iris[index, 1:4]) y = as.matrix(as.numeric(iris[index, "Species"])) - 1 param = list(eta=0.1, objective="multi:softprob") xgb.cv(params=param, data=X, nrounds=50, nfold=5, label=y, num_class=3) > train.merror.mean train.merror.std test.merror.mean test.merror.std > 1: 0

How to perform GridSearchCV with cross validation in python

阅读更多关于 How to perform GridSearchCV with cross validation in python

问题 I am performing hyperparameter tuning of RandomForest as follows using GridSearchCV . X = np.array(df[features]) #all features y = np.array(df['gold_standard']) #labels x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5) CV_rfc.fit(x_train, y

How to perform GridSearchCV with cross validation in python

阅读更多关于 How to perform GridSearchCV with cross validation in python

How to rank the instances based on prediction probability in sklearn

阅读更多关于 How to rank the instances based on prediction probability in sklearn

问题 I am using sklearn's support vector machine ( SVC ) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation . from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target clf=SVC(class_weight="balanced") proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba') print(clf.classes_) print(proba[:,1]) print(np.argsort(proba[:,1])) My expected output is as follows for print(proba[:,1]) and print(np

caret: combine createResample and groupKFold

阅读更多关于 caret: combine createResample and groupKFold

问题 I want to do a custom sampling with caret . My specifications are the following: I have 1 observation per day, and my grouping factor is the month (12 values); so in the first step I create 12 resamples with 11 months in the training (11*30 points) and 1 in the testing (30 points). This way I get 12 resamples in total. But that's not enough to me and I would like to make it a little more complex, by adding some bootstrapping of the training points of each partition. So, instead of having 11

Do I use the same Tfidf vocabulary in k-fold cross_validation

阅读更多关于 Do I use the same Tfidf vocabulary in k-fold cross_validation

问题 I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation? Currently I'm doing TF-IDF tranforming based on scikit-learn

Do I use the same Tfidf vocabulary in k-fold cross_validation

阅读更多关于 Do I use the same Tfidf vocabulary in k-fold cross_validation

Reproducible splitting of data into training and testing in R

阅读更多关于 Reproducible splitting of data into training and testing in R

问题 A common way for sampling/splitting data in R is using sample , e.g., on row numbers. For example: require(data.table) set.seed(1) population <- as.character(1e5:(1e6-1)) # some made up ID names N <- 1e4 # sample size sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids test <- sample(N-1, N/2, replace = F) test1 <- sample1[test, .(id)] The problem is that this isn't very robust to changes in the data. For example if we drop just one observation: sample2 <- sample1[

Why does calling the KFold generator with shuffle give the same indices?

阅读更多关于 Why does calling the KFold generator with shuffle give the same indices?

问题 With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this? Example: from sklearn.cross_validation import KFold X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4]) kf = KFold(4, n_folds=2, shuffle = True) for fold in kf: print fold print '---second round----'