Correlated features and classification accuracy

喜夏-厌秋 提交于 2019-12-03 01:21:43

问题


I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the perimeter and the area of a geometric figure or the level of education and the average income). In my opinion correlated features negatively affect eh accuracy of a classification algorithm, I'd say because the correlation makes one of them useless. Is it truly like this? Does the problem change with the respect of the classification algorithm type? Any suggestion on papers and lectures are really welcome! Thanks


回答1:


Correlated features do not affect classification accuracy per se. The problem in realistic situations is that we have a finite number of training examples with which to train a classifier. For a fixed number of training examples, increasing the number of features typically increases classification accuracy to a point but as the number of features continue to increase, classification accuracy will eventually decrease because we are then undersampled relative to the large number of features. To learn more about the implications of this, look at the curse of dimensionality.

If two numerical features are perfectly correlated, then one doesn't add any additional information (it is determined by the other). So if the number of features is too high (relative to the training sample size), then it is beneficial to reduce the number of features through a feature extraction technique (e.g., via principal components)

The effect of correlation does depend on the type of classifier. Some nonparametric classifiers are less sensitive to correlation of variables (although training time will likely increase with an increase in the number of features). For statistical methods such as Gaussian maximum likelihood, having too many correlated features relative to the training sample size will render the classifier unusable in the original feature space (the covariance matrix of the sample data becomes singular).




回答2:


In general, I'd say the more uncorrelated the features are, the better the classifier performance is going to be. Given a set of highly correlated features, it may be possible to use PCA techniques to make them as orthogonal as possible to improve classifier performance.



来源:https://stackoverflow.com/questions/14813884/correlated-features-and-classification-accuracy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!