I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:
- using
sklearn.cross_validation.cross_val_score
- using
sklearn.cross_validation.train_test_split
I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.
import csv import numpy as np import pandas as pd from sklearn import ensemble from sklearn.metrics import roc_auc_score from sklearn.cross_validation import train_test_split from sklearn.cross_validation import cross_val_score #read in the data data = pd.read_csv('data_so.csv', header=None) X = data.iloc[:,0:18] y = data.iloc[:,19] depth = 5 maxFeat = 3 result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2) result # result is now something like array([ 0.66773295, 0.58824739]) xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50) RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False) RFModel.fit(xtrain,ytrain) prediction = RFModel.predict_proba(xtest) auc = roc_auc_score(ytest, prediction[:,1:2]) print auc #something like 0.83 RFModel.fit(xtest,ytest) prediction = RFModel.predict_proba(xtrain) auc = roc_auc_score(ytrain, prediction[:,1:2]) print auc #also something like 0.83
My question is:
why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split
?
Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.
In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.
Thanks for your help!