How many principal components to take?

后端未结

关注

 6  1904

北海茫月 2020-12-13 00:11

I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first f

6条回答

予麋鹿 (楼主)

2020-12-13 00:41
Depending on your situation, it may be interesting to define the maximal allowed relative error by projecting your data on ndim dimensions.

I will illustrate this with a small matlab example. Just skip the code if you are not interested in it.

I will first generate a random matrix of n samples (rows) and p features containing exactly 100 non zero principal components.
```
n = 200;
p = 119;
data = zeros(n, p);
for i = 1:100
  data = data + rand(n, 1)*rand(1, p);
end
```
The image will look similar to:

For this sample image, one can calculate the relative error made by projecting your input data to ndim dimensions as follows:
```
[coeff,score] = pca(data,'Economy',true);

relativeError = zeros(p, 1);
for ndim=1:p
    reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)';
    residuals = data - reconstructed;
    relativeError(ndim) = max(max(residuals./data));
end
```
Plotting the relative error in function of the number of dimensions (principal components) results in the following graph:

Based on this graph, you can decide how many principal components you need to take into account. In this theoretical image taking 100 components result in an exact image representation. So, taking more than 100 elements is useless. If you want for example maximum 5% error, you should take about 40 principal components.

Disclaimer: The obtained values are only valid for my artificial data. So, do not use the proposed values blindly in your situation, but perform the same analysis and make a trade off between the error you make and the number of components you need.

Code reference
- Iterative algorithm is based on the source code of pcares
- A StackOverflow post about pcares
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...