cross-validation

H2O - balance classes - cross validation

岁酱吖の 提交于 2019-12-10 11:34:18
问题 I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in that case. Will only the training folds be rebalanced? I want to be sure the test-fold is not rebalanced. Thank you. 回答1: In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain

Scikit F-score metric error

做~自己de王妃 提交于 2019-12-10 03:54:15
问题 I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result. [Input] X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6) logistic = LogisticRegressionCV( Cs=50, cv=4, penalty='l2', fit_intercept=True, scoring='f1' ) logistic.fit(X_training, y_training) print('Predicted: %s' % str(logistic

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-09 14:34:57
问题 Working with Sklearn stratified kfold split, and when I attempt to split using multi-class, I received on error (see below). When I tried and split using binary, it works no problem. num_classes = len(np.unique(y_train)) y_train_categorical = keras.utils.to_categorical(y_train, num_classes) kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999) # splitting data into different folds for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)): x_train_kf, x_val

Is cv.glmnet overfitting the the data by using the full lambda sequence?

江枫思渺然 提交于 2019-12-09 13:51:30
问题 cv.glmnet has been used by most research papers and companies. While building a similar function like cv.glmnet for glmnet.cr (a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet . `cv.glmnet` first fits the model: glmnet.object = glmnet(x, y, weights = weights, offset = offset, lambda = lambda, ...) After the glmnet object is created with the complete data, the next step goes as follows: The lambda from the complete

Scikit - Combining scale and grid search

走远了吗. 提交于 2019-12-09 13:34:42
问题 I am new to scikit, and have 2 slight issues to combine a data scale and grid search. Efficient scaler Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold. My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described

Best parameters solved by Hyperopt is unsuitable

北城余情 提交于 2019-12-09 11:43:02
问题 I used hyperopt to search best parameters for SVM classifier, but Hyperopt says best 'kernel' is '0'. {'kernel': '0'} is obviously unsuitable. Does anyone know whether it's caused by my fault or a bag of hyperopt ? Code is below. from hyperopt import fmin, tpe, hp, rand import numpy as np from sklearn.metrics import accuracy_score from sklearn import svm from sklearn.cross_validation import StratifiedKFold parameter_space_svc = { 'C':hp.loguniform("C", np.log(1), np.log(100)), 'kernel':hp

cost function in cv.glm of boot library in R

这一生的挚爱 提交于 2019-12-09 04:55:28
问题 I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied. The function has the following signature: cv.glm(data, glmfit, cost, K) with the first two denoting the data and model and K specifies the k-fold. My problem is the cost parameter which is defined as: cost: A function of two vector arguments specifying the cost function for the crossvalidation. The first argument to cost

NaNs suddenly appearing for sklearn KFolds

妖精的绣舞 提交于 2019-12-08 19:40:35
I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before? y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']] X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444) This is what my X data looked like before KFolds: variation length tempo 0 0.005144 1183.148118 135.999178 1 0.002595 720.165442 117.453835 2 0.008146 397.500952 112.347147 3 0.005367 1109.819501 172.265625 4 0.001631 509.931973

How to rank the instances based on prediction probability in sklearn

与世无争的帅哥 提交于 2019-12-08 14:00:40
I am using sklearn's support vector machine ( SVC ) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation . from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target clf=SVC(class_weight="balanced") proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba') print(clf.classes_) print(proba[:,1]) print(np.argsort(proba[:,1])) My expected output is as follows for print(proba[:,1]) and print(np.argsort(proba[:,1])) where the first one indicates the prediction probability of all instances for class

How to do Cross validation in sparkr

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-08 10:47:02
问题 I am working with movie lens dataset, I have a matrix(m X n) of user id as row and movie id as columns and I have done dimension reduction technique and matrix factorization to reduce my sparse matrix (m X k, where k < n ). I want to evaluate the performance using the k-nearest neighbor algorithm (not library , my own code) . I am using sparkR 1.6.2. I don't know how to split my dataset into training data and test data in sparkR. I have tried native R function (sample, subset,CARET) but it is