ML enough features?

旧城冷巷雨未停 提交于 2019-12-11 22:02:55

问题


I'm trying to train a random forest on accelerometer dataset. I calculate features like mean, sd, correlation between axes, area under curve and others. I'm a ML Noob.

I'm trying to understand two things:

1.If I split the dataset from one person into test and train and run the RF prediction the accuracy is high (> 90%). However, if I train the RF with data from different people and then predict, the accuracy is low (< 50%). Why? How do I debug this? Not sure what I'm doing wrong.

  1. In the above example, to get to 90% accuracy, how many features are "enough"? How much data is "enough"?

I can furnish more details. Dataset is from 10 people, large files of labelled data. I have limited myself to the above features to avoid lots of compute.


回答1:


  1. Most probably your classifier overfits, when you training it only on 1 person it not generalizes well, it may simply "memorize" dataset with labels instead of capturing general rules of distribution:how each feature correlated with other/how they affect result/etc. Maybe you need more data, or more features.

  2. It's not so easy question, it is generalization problem, there are many theoretical researches about this, for example: Vapnik–Chervonenkis theory Akaike_information_criterion. And even with knowledge of such theories you cannot answer to this question accurately. The main principle of most of such theories - the more data you have, less variative model you trying to fit and less difference between accuracy on training and test you requiring - this theories will rank your model higher. E.g if you wan't to minimize difference between accuracy on test and training set (to make sure that accuracy on test data will not collapse) - you need to increase amount of data, provide more meaningful features (with respect to your model), or use less variative model for fitting. If you interesting in a more detailed explanation about theoretical aspect, you can watch lectures from caltech, starting from this CaltechX - CS1156x Learning from data.



来源:https://stackoverflow.com/questions/31550337/ml-enough-features

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!