Evaluating Logistic regression with cross validation

匿名 (未验证) 提交于 2019-12-03 02:50:02

问题:

I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).

These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.

Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?

Thank you

import pandas as pd  Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo') feature_cols=['A','B','C','D','E'] X=Data[feature_cols]  Y=Data['Status']  Y1=Data['Status1']  # predictions from elsewhere Y2=Data['Status2'] # predictions from elsewhere  from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() logreg.fit(X_train,y_train)  from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  from sklearn import metrics, cross_validation predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10) metrics.accuracy_score(y, predicted)   from sklearn.cross_validation import cross_val_score accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy') print (accuracy) print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())  from nltk import ConfusionMatrix  print (ConfusionMatrix(list(y), list(predicted))) #print (ConfusionMatrix(list(y), list(yexpert)))  # sensitivity: print (metrics.recall_score(y, predicted) )  import matplotlib.pyplot as plt  probs = logreg.predict_proba(X)[:, 1]  plt.hist(probs)  plt.show()  # use 0.5 cutoff for predicting 'default'  import numpy as np  preds = np.where(probs > 0.5, 1, 0)  print (ConfusionMatrix(list(y), list(preds)))  # check accuracy, sensitivity, specificity  print (metrics.accuracy_score(y, predicted))   #ROC CURVES and AUC  # plot ROC curve  fpr, tpr, thresholds = metrics.roc_curve(y, probs)  plt.plot(fpr, tpr)  plt.xlim([0.0, 1.0])  plt.ylim([0.0, 1.0])  plt.xlabel('False Positive Rate')  plt.ylabel('True Positive Rate)')  plt.show()  # calculate AUC  print (metrics.roc_auc_score(y, probs))  # use AUC as evaluation metric for cross-validation  from sklearn.cross_validation import cross_val_score  logreg = LogisticRegression()  cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()  

回答1:

You got it almost right. cross_validation.cross_val_predict gives you predictions for the entire dataset. You just need to remove logreg.fit earlier in the code. Specifically, what it does is the following: It divides your dataset in to n folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1 folds). So, in the end you will get predictions for the entire data.

Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data'] is X and iris['target'] is y

In [15]: iris['data'].shape Out[15]: (150, 4) 

To get predictions on the entire set with cross validation you can do the following:

from sklearn.linear_model import LogisticRegression from sklearn import metrics, cross_validation from sklearn import datasets iris = datasets.load_iris() predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10) print metrics.accuracy_score(iris['target'], predicted)  Out [1] : 0.9537  print metrics.classification_report(iris['target'], predicted)   Out [2] :                      precision    recall  f1-score   support                  0       1.00      1.00      1.00        50                 1       0.96      0.90      0.93        50                 2       0.91      0.96      0.93        50        avg / total       0.95      0.95      0.95       150 

So, back to your code. All you need is this:

from sklearn import metrics, cross_validation logreg=LogisticRegression() predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10) print metrics.accuracy_score(y, predicted) print metrics.classification_report(y, predicted)  

For plotting ROC in multi-class classification, you can follow this tutorial which gives you something like the following:

In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!