Guru,
When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.
However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.
- The Python Scikit learn PCA manual is here
Yes, you are nearly right. The pca.explained_variance_ratio_
parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i]
gives the variance explained solely by the i+1st dimension.
You probably want to do pca.explained_variance_ratio_.cumsum()
. That will return a vector x
such that x[i]
returns the cumulative variance explained by the first i+1 dimensions.
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(0)
my_matrix = np.random.randn(20, 5)
my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)
print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()
[ 1.50756565 1.29374452 0.97042041 0.61712667 0.31529082]
[ 0.32047581 0.27502207 0.20629036 0.13118776 0.067024 ]
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
So in my random toy data, if I picked k=4
I would retain 93.3% of the variance.
Although this question is older than 2 years i want to provide an update on this. I wanted to do the same and it looks like sklearn now provides this feature out of the box.
As stated in the docs
if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components
So the code required is now
my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)
来源:https://stackoverflow.com/questions/32857029/python-scikit-learn-pca-explained-variance-ratio-cutoff