Python scikit learn pca.explained_variance_ratio_ cutoff

后端 未结 3 1157
甜味超标
甜味超标 2020-12-23 21:05

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.

However, in the Pytho

相关标签:
3条回答
  • 2020-12-23 21:36

    This worked for me with even less typing in the PCA section. The rest is added for convenience. Only 'data' needs to be defined in an earlier stage.

    import sklearn as sl
    from sklearn.preprocessing import StandardScaler as ss
    from sklearn.decomposition import PCA 
    
    st = ss().fit_transform(data)
    pca = PCA(0.80)
    pc = pca.fit_transform(st) # << to retain the components in an object
    pc
    
    #pca.explained_variance_ratio_
    print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
          round(pca.explained_variance_ratio_.sum(),5)  )
    
    0 讨论(0)
  • 2020-12-23 21:45

    Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.

    You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

    import numpy as np
    from sklearn.decomposition import PCA
    
    np.random.seed(0)
    my_matrix = np.random.randn(20, 5)
    
    my_model = PCA(n_components=5)
    my_model.fit_transform(my_matrix)
    
    print my_model.explained_variance_
    print my_model.explained_variance_ratio_
    print my_model.explained_variance_ratio_.cumsum()
    

    [ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082]
    [ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ]
    [ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]
    

    So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.

    0 讨论(0)
  • 2020-12-23 21:47

    Although this question is older than 2 years i want to provide an update on this. I wanted to do the same and it looks like sklearn now provides this feature out of the box.

    As stated in the docs

    if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

    So the code required is now

    my_model = PCA(n_components=0.99, svd_solver='full')
    my_model.fit_transform(my_matrix)
    
    0 讨论(0)
提交回复
热议问题