Using Principal Components Analysis (PCA) on binary data

问题

I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.

Thank you.

回答1:

The principal components of 0/1 data can fall off slowly or rapidly, and the PCs of continuous data too — it depends on the data. Can you describe your data ?

The following picture is intended to compare the PCs of continuous image data vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.

Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U V^T, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long. The top row is c U1 V, the second row is c U2 V ... all the rows are proportional to V. Similarly the leftmost column is c U V1 ... all the columns are proportional to U.
But if all rows are similar (proportional to each other), they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1^T + c2 U2 V2^T + ..., we can get nearer to A: the smaller the higher c_i, the faster.. (Of course, all 500 terms recreate A exactly, to within roundoff error.)

The top row is "lena", a well-known 512 x 512 matrix, with 1-term and 10-term SVD approximations. The bottom row is lena discretized to 0/1, again with 1 term and 10 terms. I thought that the 0/1 lena would be much worse -- comments, anyone ?

(U V^T is also written U ⊗ V, called a "dyad" or "outer product".)

(The wikipedia articles Singular value decomposition and Low-rank approximation are a bit math-heavy. An AMS column by David Austin, We Recommend a Singular Value Decomposition gives some intuition on SVD / PCA -- highly recommended.)

来源：https://stackoverflow.com/questions/13505296/using-principal-components-analysis-pca-on-binary-data

标签

pca

svd