Incremental PCA on big data

限于喜欢 提交于 2019-11-27 14:57:59

问题


I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?

from sklearn.decomposition import IncrementalPCA
import h5py

db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
    X = check_array(X, dtype=np.float)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
    array = np.atleast_2d(array)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
    ary = asanyarray(ary)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError

Thanks for help


回答1:


You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)

If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

As I don't have your data, let me start from creating a random dataset of the same size:

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

It creates a nice 3.8 GiB file.

Now, if we are in Linux, we can limit how much memory is available to our program:

$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152

Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
icpa = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.



来源:https://stackoverflow.com/questions/31428581/incremental-pca-on-big-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!