cross-validation | 易学教程

H2O - balance classes - cross validation

阅读更多关于 H2O - balance classes - cross validation

问题 I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced. Thank you. 回答1: In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain

Scikit F-score metric error

阅读更多关于 Scikit F-score metric error

问题 I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result. [Input] X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6) logistic = LogisticRegressionCV( Cs=50, cv=4, penalty='l2', fit_intercept=True, scoring='f1' ) logistic.fit(X_training, y_training) print('Predicted: %s' % str(logistic

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

阅读更多关于 Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

问题 Working with Sklearn stratified kfold split, and when I attempt to split using multi-class, I received on error (see below). When I tried and split using binary, it works no problem. num_classes = len(np.unique(y_train)) y_train_categorical = keras.utils.to_categorical(y_train, num_classes) kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999) # splitting data into different folds for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)): x_train_kf, x_val

Is cv.glmnet overfitting the the data by using the full lambda sequence?

阅读更多关于 Is cv.glmnet overfitting the the data by using the full lambda sequence?

问题 cv.glmnet has been used by most research papers and companies. While building a similar function like cv.glmnet for glmnet.cr (a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet . `cv.glmnet` first fits the model: glmnet.object = glmnet(x, y, weights = weights, offset = offset, lambda = lambda, ...) After the glmnet object is created with the complete data, the next step goes as follows: The lambda from the complete

Scikit - Combining scale and grid search

阅读更多关于 Scikit - Combining scale and grid search

问题 I am new to scikit, and have 2 slight issues to combine a data scale and grid search. Efficient scaler Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold. My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described

Best parameters solved by Hyperopt is unsuitable

阅读更多关于 Best parameters solved by Hyperopt is unsuitable

问题 I used hyperopt to search best parameters for SVM classifier, but Hyperopt says best 'kernel' is '0'. {'kernel': '0'} is obviously unsuitable. Does anyone know whether it's caused by my fault or a bag of hyperopt ? Code is below. from hyperopt import fmin, tpe, hp, rand import numpy as np from sklearn.metrics import accuracy_score from sklearn import svm from sklearn.cross_validation import StratifiedKFold parameter_space_svc = { 'C':hp.loguniform("C", np.log(1), np.log(100)), 'kernel':hp

cost function in cv.glm of boot library in R

阅读更多关于 cost function in cv.glm of boot library in R

问题 I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied. The function has the following signature: cv.glm(data, glmfit, cost, K) with the first two denoting the data and model and K specifies the k-fold. My problem is the cost parameter which is defined as: cost: A function of two vector arguments specifying the cost function for the crossvalidation. The first argument to cost

NaNs suddenly appearing for sklearn KFolds

阅读更多关于 NaNs suddenly appearing for sklearn KFolds

I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before? y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']] X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444) This is what my X data looked like before KFolds: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720.165442 117.453835 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973

How to rank the instances based on prediction probability in sklearn

阅读更多关于 How to rank the instances based on prediction probability in sklearn

I am using sklearn's support vector machine ( SVC ) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation . from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target clf=SVC(class_weight="balanced") proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba') print(clf.classes_) print(proba[:,1]) print(np.argsort(proba[:,1])) My expected output is as follows for print(proba[:,1]) and print(np.argsort(proba[:,1])) where the first one indicates the prediction probability of all instances for class

How to do Cross validation in sparkr

阅读更多关于 How to do Cross validation in sparkr

问题 I am working with movie lens dataset, I have a matrix(m X n) of user id as row and movie id as columns and I have done dimension reduction technique and matrix factorization to reduce my sparse matrix (m X k, where k < n ). I want to evaluate the performance using the k-nearest neighbor algorithm (not library , my own code) . I am using sparkR 1.6.2. I don't know how to split my dataset into training data and test data in sparkR. I have tried native R function (sample, subset,CARET) but it is