What is the difference between Principal Component Analysis (PCA) and Feature Selection in Machine Learning? Is PCA a means of feature selection?
Just to add to the answer by @Roger Rowland. In the context of supervised learning (classification, regression) I like to think of PCA as a "feature transformer" rather then a feature selector.
PCA is based on extracting the axes on which data shows the highest variability. Although it “spreads out” data in the new basis, and can be of great help in unsupervised learning, there is no guarantee that the new axes are consistent with the discriminatory features in a supervised problem.
Put more simply, there is no guarantee at all that your top principal components are the most informative when it comes to predicting the dependent variable (e.g. class label).
This paper is a useful source. Another relevant crossvalidated link is here.