Python MemoryError when doing fitting with Scikit-learn

后端 未结 2 637
天命终不由人
天命终不由人 2020-12-15 07:45

I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge, the code runs f

2条回答
  •  暖寄归人
    2020-12-15 08:36

    Take a look at this part of your stack trace:

        531     def _pre_compute_svd(self, X, y):
        532         if sparse.issparse(X) and hasattr(X, 'toarray'):
    --> 533             X = X.toarray()
        534         U, s, _ = np.linalg.svd(X, full_matrices=0)
        535         v = s ** 2
    

    The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:

    183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes
    

    Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?

    First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.

    Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd here, so you might be able to use scipy.sparse.linalg.svds instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.

提交回复
热议问题