问题
I'm using scikit to make a supervised classifier and I am currently tuning it to give me good accuracy on the labeled data. But how do I estimate how well it does on the test data (unlabeled)?
Also, how do I find out if I'm starting to overfit the classifier?
回答1:
You can't score your method on unlabeled data because you need to know right answers. In order to evaluate a method you should split your trainset into (new) train and test (via sklearn.cross_validation.train_test_split, for example). Then fit the model to the train and score it on test. If you don't have a lot of data and holding out some of it may negatively impact performance of an algorithm, use cross validation.
Since overfitting is inability to generalize, low test scores is a good indicator of it.
For more theory and some other approaches, take a look at this article.
来源:https://stackoverflow.com/questions/24315765/how-do-you-estimate-the-performance-of-a-classifier-on-test-data