Why Sklearn PCA needs more samples than new features(n_components)?

淺唱寂寞╮ 提交于 2019-12-10 21:07:59

问题


When using Sklearn PCA algorithm like this

x_orig = np.random.choice([0,1],(4,25),replace = True)
pca = PCA(n_components=15)
pca.fit_transform(x_orig).shape

I get output

(4, 4)

I expected(want) it to be:

(4,15)

I get why its happening. In the documentation of sklearn (here) it says(assuming their '==' is assignment operator):

n_components == min(n_samples, n_features)

But why are they doing this? Also, how can I convert an input with shape [1,25] to [1,10] directly (without stacking dummy arrays)?


回答1:


Each principal component is the projection of the data on an eigenvector of the data covariance matrix. If you have less samples n than features the covariance matrix has only n non-zero eigenvalues. Thus, there are only n eigenvectors/components that make sense.

In principle it could be possible to have more components than samples, but the superfluous components would be useless noise.

Scikit-learn raises an error instead of silently doing anything. This prevents users from shooting themselves in the foot. Having less samples than features can indicate a problem with the data, or a misconception about the methods involved.



来源:https://stackoverflow.com/questions/51040075/why-sklearn-pca-needs-more-samples-than-new-featuresn-components

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!