Performing PCA on large sparse matrix by using sklearn

微笑、不失礼 提交于 2019-11-30 03:40:49

Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp

Create a random sparse matrix with 0.01% of its data as non-zeros.

>>> X = sp.rand(1000, 1000, density=0.0001)

Apply PCA to it:

>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)

Now, check the results:

>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000

which suggests that 95000 of the entries are non-zero, however,

>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000

99481 elements are close to 0 (<1e-15), but not 0.

Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.

I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.

The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):

>>> clf.explained_variance_ratio_.sum()

PCA(X) is SVD(X-mean(X)). Even If X is a sparse matrix, X-mean(X) is always a dense matrix. Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix. However, delayed evaluation

delay(X-mean(X))

can avoid expanding the sparse matrix X to the dense matrix X-mean(X). The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.

This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/

You can see the code of the PCA using this mechanism : https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py

Performance comparisons to existing methods show this mechanism drastically reduces required memory size : https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh

More detail description of this technique can be found in my patent : https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!