Inserting null columns into a scipy sparse matrix in a specific order

问题

I have a sparse matrix with M rows and N columns, to which I want to concatenate K additional NULL columns so my objects will have now M rows and (N+K) columns. The tricky part is that I also have a list of indeces of length N, which can range from 0 to N+K, that indicate what is the position that every column should have in the new matrix.

So for example, if N = 2, K = 1 and the list of indices is [2, 0], it means that I want to take the last column from my MxN matrix to be the first one, the introduce a null column and then put my first column as the last one.

I'm trying to use the following code - when I already have x but I can't upload it here.

import numpy as np
from scipy import sparse
M = 5000
N = 10
pad_factor = 1.2
size = int(pad_factor * N)
x = sparse.random(m = M, n = N, density = 0.1, dtype = 'float64')
indeces = np.random.choice(range(size), size=N, replace=False)
null_mat = sparse.lil_matrix((M, size))
null_mat[:, indeces] = x

The problem is that for N = 1,500,000, P = 5,000 and K = 200 this code won't scale and it will give me a memory error. The exact error is: "return np.zeros(self.shape, dtype = self.dtype, order=order) MemoryError".

I just want to add some null columns so I guess my slicing idea is inefficient, especially as K << N in my real data. In a way we can think about this as a merge sort problem - I have a non-null and a null dataset and I want to concatenate them, but in a specific order. Any ideas on how to make it work?

Thanks!

回答1:

As I deduced in the comments, the memory error was produced in the

null_mat[:, indeces] = x

line because the lil __setitem__ method, does a x.toarray(), that is, it first converts x to a dense array. Mapping the sparse matrix onto the index lil directly might be more space efficient, but a lot more work to code. And lil is optimized for iterative assignment, not this large scale matrix mapping.

sparse.hstack uses sparse.bmat to join sparse matrices. This converts all inputs to coo, and then combines their attributes into a new set, building the new matrix from those.

direct coo matrix construction

After quite a bit of playing around, I found that the following simple operation works:

In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))

In [480]: z1
Out[480]: 
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in COOrdinate format>

Compare this with the x and null_mat:

In [481]: x
Out[481]: 
<5000x10 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]: 
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
    with 5000 stored elements in LInked List format>

Testing the equality of sparse matrices can be tricky. coo values in particular can occur in any order, such as in x which was produced by sparse.random.

But the csr format orders the rows, so this comparison of the indptr attribute is a pretty good equality test:

In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True

A time test:

In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [478]: 
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

matrix multiplication approach

csr format indexing with lists is done with matrix multiplication. It constructs an extractor matrix, and applies that. Matrix multiplication is a csr_matrix strong point.

We can perform the reordering in the same way:

In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]: 
<10x12 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [496]: w1=x*I

Comparing the dense equivalents of these matrices:

In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True


In [499]: %%timeit
     ...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
     ...: 12))
     ...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's better than the lil indexing approach, though still much slower than the direct coo matrix construction. Though to be fair, we should construct a csr matrix from the coo style inputs. That conversion takes some time:

In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
     ...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

error traceback

The MemoryError traceback should have revealed that the error occurred in this indexed assignment, and that the relevant method calls are:

Signature: null_mat.__setitem__(index, x)
Source:   
    def __setitem__(self, index, x):
       ....
       if isspmatrix(x):
           x = x.toarray()
       ...

Signature: x.toarray(order=None, out=None)
Source:   
    def toarray(self, order=None, out=None):
        """See the docstring for `spmatrix.toarray`."""
        B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:   
    def _process_toarray_args(self, order, out):
        ...
        return np.zeros(self.shape, dtype=self.dtype, order=order)

I found this by doing a code search on the scipy github, for the np.zeros calls.

来源：https://stackoverflow.com/questions/49652985/inserting-null-columns-into-a-scipy-sparse-matrix-in-a-specific-order

标签

python

scipy

sparse-matrix