How can I use PCA/SVD in Python for feature selection AND identification?

问题

I'm following Principal component analysis in Python to use PCA under Python, but am struggling with determining which features to choose (i.e. which of my columns/features have the best variance).

When I use scipy.linalg.svd, it automatically sorts my Singular Values, so I can't tell which column they belong to.

Example code:

import numpy as np
from scipy.linalg import svd
M = [
     [1, 1, 1, 1, 1, 1],
     [3, 3, 3, 3, 3, 3],
     [2, 2, 2, 2, 2, 2],
     [9, 9, 9, 9, 9, 9]
]
M = np.transpose(np.array(M))
U,s,Vt = svd(M, full_matrices=False)
print s

Is there a different way to go about this without the Singular Values being sorted?

Update: It looks like this might not be possible, at least according to this post on the Matlab forums: http://www.mathworks.com/matlabcentral/newsreader/view_thread/241607. If anyone knows otherwise, let me know :)

回答1:

I was under the wrong impression that PCA did feature selection, whereas instead it does feature extraction.

Instead, PCA creates a new series of features, each of which is a combination of the input features.

From PCA, if you really wanted to do feature selection, you could look at the weightings of the input features on the PCA created features. For instance, the matplotlib.mlab.PCA library provides the weights in a property (more on library):

from matplotlib.mlab import PCA
res = PCA(data)
print "weights of input vectors: %s" % res.Wt

Sounds like the feature extraction route is the way to use PCA though.

来源：https://stackoverflow.com/questions/14205941/how-can-i-use-pca-svd-in-python-for-feature-selection-and-identification

标签

python

scipy

pca