可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a matrix with 20 columns. The last column are 0/1 labels.

The link to the data is here.

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:

using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

import csv import numpy as np import pandas as pd from sklearn import ensemble from sklearn.metrics import roc_auc_score from sklearn.cross_validation import train_test_split from sklearn.cross_validation import cross_val_score  #read in the data data = pd.read_csv('data_so.csv', header=None) X = data.iloc[:,0:18] y = data.iloc[:,19]  depth = 5 maxFeat = 3   result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)  result # result is now something like array([ 0.66773295,  0.58824739])  xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)  RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False) RFModel.fit(xtrain,ytrain) prediction = RFModel.predict_proba(xtest) auc = roc_auc_score(ytest, prediction[:,1:2]) print auc    #something like 0.83  RFModel.fit(xtest,ytest) prediction = RFModel.predict_proba(xtrain) auc = roc_auc_score(ytrain, prediction[:,1:2]) print auc    #also something like 0.83

My question is:

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

Thanks for your help!

回答1:

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

The KFolds iterator has a random state parameter:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

文章来源: Difference between using train_test_split and cross_val_score in sklearn.cross_validation

标签

auc