cross-validation

How to compute accuracy and the confusion matrix using K-fold cross-validation?

戏子无情 提交于 2020-01-04 02:03:34
问题 I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval? Could someone help me? My code is: import numpy as np from sklearn import model_selection from sklearn import datasets from sklearn import svm import pandas as pd from sklearn.linear_model import LogisticRegression UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv') previsores = UNSW.iloc[:,UNSW.columns

Tuning xgboost with xgb.train providing a validation set in R

南楼画角 提交于 2020-01-04 01:26:29
问题 Related questions here and here. The common way of tuning xgboost (i.e. nrounds) is using xgb.cv that performs k-fold cross validation, for example: require(xgboost) data(iris) set.seed(1) index = sample(1:150) X = as.matrix(iris[index, 1:4]) y = as.matrix(as.numeric(iris[index, "Species"])) - 1 param = list(eta=0.1, objective="multi:softprob") xgb.cv(params=param, data=X, nrounds=50, nfold=5, label=y, num_class=3) > train.merror.mean train.merror.std test.merror.mean test.merror.std > 1: 0

How to perform GridSearchCV with cross validation in python

只谈情不闲聊 提交于 2020-01-03 11:48:01
问题 I am performing hyperparameter tuning of RandomForest as follows using GridSearchCV . X = np.array(df[features]) #all features y = np.array(df['gold_standard']) #labels x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5) CV_rfc.fit(x_train, y

How to perform GridSearchCV with cross validation in python

情到浓时终转凉″ 提交于 2020-01-03 11:47:29
问题 I am performing hyperparameter tuning of RandomForest as follows using GridSearchCV . X = np.array(df[features]) #all features y = np.array(df['gold_standard']) #labels x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5) CV_rfc.fit(x_train, y

How to rank the instances based on prediction probability in sklearn

时光怂恿深爱的人放手 提交于 2020-01-03 01:51:08
问题 I am using sklearn's support vector machine ( SVC ) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation . from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target clf=SVC(class_weight="balanced") proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba') print(clf.classes_) print(proba[:,1]) print(np.argsort(proba[:,1])) My expected output is as follows for print(proba[:,1]) and print(np

caret: combine createResample and groupKFold

人盡茶涼 提交于 2020-01-02 10:15:36
问题 I want to do a custom sampling with caret . My specifications are the following: I have 1 observation per day, and my grouping factor is the month (12 values); so in the first step I create 12 resamples with 11 months in the training (11*30 points) and 1 in the testing (30 points). This way I get 12 resamples in total. But that's not enough to me and I would like to make it a little more complex, by adding some bootstrapping of the training points of each partition. So, instead of having 11

Do I use the same Tfidf vocabulary in k-fold cross_validation

倖福魔咒の 提交于 2020-01-02 02:04:33
问题 I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation? Currently I'm doing TF-IDF tranforming based on scikit-learn

Do I use the same Tfidf vocabulary in k-fold cross_validation

自作多情 提交于 2020-01-02 02:04:12
问题 I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation? Currently I'm doing TF-IDF tranforming based on scikit-learn

Reproducible splitting of data into training and testing in R

本小妞迷上赌 提交于 2020-01-01 22:15:12
问题 A common way for sampling/splitting data in R is using sample , e.g., on row numbers. For example: require(data.table) set.seed(1) population <- as.character(1e5:(1e6-1)) # some made up ID names N <- 1e4 # sample size sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids test <- sample(N-1, N/2, replace = F) test1 <- sample1[test, .(id)] The problem is that this isn't very robust to changes in the data. For example if we drop just one observation: sample2 <- sample1[

Why does calling the KFold generator with shuffle give the same indices?

折月煮酒 提交于 2020-01-01 16:51:12
问题 With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this? Example: from sklearn.cross_validation import KFold X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4]) kf = KFold(4, n_folds=2, shuffle = True) ​ for fold in kf: print fold ​ print '---second round----'