问题
I'm trying to train a random forest on accelerometer dataset. I calculate features like mean, sd, correlation between axes, area under curve and others. I'm a ML Noob.
I'm trying to understand two things:
1.If I split the dataset from one person into test and train and run the RF prediction the accuracy is high (> 90%). However, if I train the RF with data from different people and then predict, the accuracy is low (< 50%). Why? How do I debug this? Not sure what I'm doing wrong.
- In the above example, to get to 90% accuracy, how many features are "enough"? How much data is "enough"?
I can furnish more details. Dataset is from 10 people, large files of labelled data. I have limited myself to the above features to avoid lots of compute.
回答1:
Most probably your classifier overfits, when you training it only on 1 person it not generalizes well, it may simply "memorize" dataset with labels instead of capturing general rules of distribution:how each feature correlated with other/how they affect result/etc. Maybe you need more data, or more features.
It's not so easy question, it is generalization problem, there are many theoretical researches about this, for example: Vapnik–Chervonenkis theory Akaike_information_criterion. And even with knowledge of such theories you cannot answer to this question accurately. The main principle of most of such theories - the more data you have, less variative model you trying to fit and less difference between accuracy on training and test you requiring - this theories will rank your model higher. E.g if you wan't to minimize difference between accuracy on test and training set (to make sure that accuracy on test data will not collapse) - you need to increase amount of data, provide more meaningful features (with respect to your model), or use less variative model for fitting. If you interesting in a more detailed explanation about theoretical aspect, you can watch lectures from caltech, starting from this CaltechX - CS1156x Learning from data.
来源:https://stackoverflow.com/questions/31550337/ml-enough-features