Using Principal Components Analysis (PCA) on binary data

不问归期 提交于 2019-12-04 20:42:32

The principal components of 0/1 data can fall off slowly or rapidly, and the PCs of continuous data too — it depends on the data. Can you describe your data ?

The following picture is intended to compare the PCs of continuous image data vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.

Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U VT, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long. The top row is c U1 V, the second row is c U2 V ... all the rows are proportional to V. Similarly the leftmost column is c U V1 ... all the columns are proportional to U.
But if all rows are similar (proportional to each other), they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1T + c2 U2 V2T + ..., we can get nearer to A: the smaller the higher ci, the faster.. (Of course, all 500 terms recreate A exactly, to within roundoff error.)

The top row is "lena", a well-known 512 x 512 matrix, with 1-term and 10-term SVD approximations. The bottom row is lena discretized to 0/1, again with 1 term and 10 terms. I thought that the 0/1 lena would be much worse -- comments, anyone ?

(U VT is also written U ⊗ V, called a "dyad" or "outer product".)

(The wikipedia articles Singular value decomposition and Low-rank approximation are a bit math-heavy. An AMS column by David Austin, We Recommend a Singular Value Decomposition gives some intuition on SVD / PCA -- highly recommended.)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!