cross-validation

Cross validation for glm() models

浪子不回头ぞ 提交于 2019-12-02 17:19:05
I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula: library(boot) cv.glm(data, glmfit, K=10) Does the "data" argument here refer to the whole dataset or only to the test set? The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).

How to perform random forest/cross validation in R

本秂侑毒 提交于 2019-12-02 16:48:39
I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce. So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time. ID RT (seconds) 1_MW 2_AMW 3_Sv 4_Se 4281 38 145.29 5.01 14.76 28.37 4952 40 132.19 6.29 11 21.28 4823 41 176.21 7.34 12.9 24.92 3840 41 174.24 6.7 13.99 26.48 3665 42 240.34 9.24 15.2 27.08 3591

k nearest neighbors with cross validation for accuracy score and confusion matrix

你。 提交于 2019-12-02 14:04:54
问题 I have the following data where for each column, the rows with numbers are the input and the letter is the output. A,A,A,B,B,B -0.979090189,0.338819904,-0.253746508,0.213454999,-0.580601104,-0.441683968 -0.48395313,0.436456904,-1.427424032,-0.107093825,0.320813402,0.060866105 -1.098818173,-0.999161692,-1.371721698,-1.057324962,-1.161752652,-0.854872591 -1.53191442,-1.465454248,-1.350414216,-1.732518018,-1.674040715,-1.561568496 2.522796162,2.498153298,3.11756171,2.125738509,3.003929536,2

Scikit-learn TypeError: If no scoring is specified, the estimator passed should have a 'score' method

烈酒焚心 提交于 2019-12-02 12:57:42
I have created a custom model in python using scikit-learn, and I want to use cross validation. The class for the model is defined as follows: class MultiLabelEnsemble: ''' MultiLabelEnsemble(predictorInstance, balance=False) Like OneVsRestClassifier: Wrapping class to train multiple models when several objectives are given as target values. Its predictor may be an ensemble. This class can be used to create a one-vs-rest classifier from multiple 0/1 labels to treat a multi-label problem or to create a one-vs-rest classifier from a categorical target variable. Arguments: predictorInstance -- A

How to display confusion matrix and report (recall, precision, fmeasure) for each cross validation fold

泪湿孤枕 提交于 2019-12-02 12:13:17
问题 I am trying to perform 10 fold cross validation in python. I know how to calculate the confusion matrix and the report for split test(example split 80% training and 20% testing). But the problem is I don't know how to calculate the confusion matrix and report for each folds for example when fold-10, I just know code for average accuracy. 回答1: Here is a reproducible example with the breast cancer data and 3-fold CV for simplicity: from sklearn.datasets import load_breast_cancer from sklearn

How to perform cross-validation in keras functional api in python

风格不统一 提交于 2019-12-02 10:26:52
问题 I want to perform cross validation on a Keras model with multiple inputs. So, I tried KerasClassifier . This works fine with a normal sequential model with only one input. However, when using the functional api and extending to two inputs sklearn's cross_val_predict does not seem to work as expected. def create_model(): input_text = Input(shape=(1,), dtype=tf.string) embedding = Lambda(UniversalEmbedding, output_shape=(512, ))(input_text) dense = Dense(256, activation='relu')(embedding) input

ValueError: Cannot have number of splits n_splits=3 greater than the number of samples: 1

喜你入骨 提交于 2019-12-02 08:00:35
I am trying this training modeling using train_test_split and a decision tree regressor: import sklearn from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import cross_val_score # TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature new_data = samples.drop('Fresh', 1) # TODO: Split the data into training and testing sets using the given feature as the target X_train, X_test, y_train, y_test = train_test_split(new_data, samples['Fresh'], test_size=0.25, random_state=0) # TODO: Create

Why do I get different values with pipline and without pipline in sklearn in python

空扰寡人 提交于 2019-12-02 04:53:58
问题 I am using recursive feature elimination with cross-validation (rfecv) with GridSearchCV with RandomForest classifier as follows using pipeline and without using pipeline . My code with pipeline is as follows. X = df[my_features_all] y = df['gold_standard'] #get development and testing sets x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0) from sklearn.pipeline import Pipeline #cross validation setting k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

cv.glm variable lengths differ

半世苍凉 提交于 2019-12-02 02:46:25
问题 I am trying to cv.glm on a linear model however each time I do I get the error Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv + : variable lengths differ (found for 'air-force-falcons') air-force-falcons is the first variable in the dataset lindata. When I run glm I get no errors. All the variables are in a single dataset and there are no missing values. > linearmod5<- glm(lindata$Y ~ 0 + lindata$HomeAdv + ., data=lindata, na.action="na.exclude") > set.seed(1) > cv.err

How to calculate feature importance in each models of cross validation in sklearn

喜欢而已 提交于 2019-12-02 02:30:20
I am using RandomForestClassifier() with 10 fold cross validation as follows. clf=RandomForestClassifier(random_state = 42, class_weight="balanced") k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy') print(accuracy.mean()) I want to identify the important features in my feature space. It seems to be straightforward to get the feature importance for single classification as follows. print("Features sorted by their score:") feature_importances = pd.DataFrame(clf.feature_importances_, index = X_train.columns,