I\'d like to use principal component analysis (PCA) for dimensionality reduction. Does numpy or scipy already have it, or do I have to roll my own using numpy.linalg.eigh?<
matplotlib.mlab has a PCA implementation.
If you're working with 3D vectors, you can apply SVD concisely using the toolbelt vg. It's a light layer on top of numpy.
import numpy as np
import vg
vg.principal_components(data)
There's also a convenient alias if you only want the first principal component:
vg.major_axis(data)
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are verbose or opaque in NumPy.
You might have a look at MDP.
I have not had the chance to test it myself, but I've bookmarked it exactly for the PCA functionality.
You do not need full Singular Value Decomposition (SVD) at it computes all eigenvalues and eigenvectors and can be prohibitive for large matrices. scipy and its sparse module provide generic linear algrebra functions working on both sparse and dense matrices, among which there is the eig* family of functions :
http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html#matrix-factorizations
Scikit-learn provides a Python PCA implementation which only support dense matrices for now.
Timings :
In [1]: A = np.random.randn(1000, 1000)
In [2]: %timeit scipy.sparse.linalg.eigsh(A)
1 loops, best of 3: 802 ms per loop
In [3]: %timeit np.linalg.svd(A)
1 loops, best of 3: 5.91 s per loop
You can quite easily "roll" your own using scipy.linalg
(assuming a pre-centered dataset data
):
covmat = data.dot(data.T)
evs, evmat = scipy.linalg.eig(covmat)
Then evs
are your eigenvalues, and evmat
is your projection matrix.
If you want to keep d
dimensions, use the first d
eigenvalues and first d
eigenvectors.
Given that scipy.linalg
has the decomposition and numpy the matrix multiplications, what else do you need?