How do you estimate the performance of a classifier on test data?

问题

I'm using scikit to make a supervised classifier and I am currently tuning it to give me good accuracy on the labeled data. But how do I estimate how well it does on the test data (unlabeled)?

Also, how do I find out if I'm starting to overfit the classifier?

回答1:

You can't score your method on unlabeled data because you need to know right answers. In order to evaluate a method you should split your trainset into (new) train and test (via sklearn.cross_validation.train_test_split, for example). Then fit the model to the train and score it on test. If you don't have a lot of data and holding out some of it may negatively impact performance of an algorithm, use cross validation.

Since overfitting is inability to generalize, low test scores is a good indicator of it.

For more theory and some other approaches, take a look at this article.

来源：https://stackoverflow.com/questions/24315765/how-do-you-estimate-the-performance-of-a-classifier-on-test-data

标签

python

machine-learning

scikit-learn

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!