问题
I have a sparse matrix with M rows and N columns, to which I want to concatenate K additional NULL columns so my objects will have now M rows and (N+K) columns. The tricky part is that I also have a list of indeces of length N, which can range from 0 to N+K, that indicate what is the position that every column should have in the new matrix.
So for example, if N = 2, K = 1 and the list of indices is [2, 0], it means that I want to take the last column from my MxN matrix to be the first one, the introduce a null column and then put my first column as the last one.
I'm trying to use the following code - when I already have x but I can't upload it here.
import numpy as np
from scipy import sparse
M = 5000
N = 10
pad_factor = 1.2
size = int(pad_factor * N)
x = sparse.random(m = M, n = N, density = 0.1, dtype = 'float64')
indeces = np.random.choice(range(size), size=N, replace=False)
null_mat = sparse.lil_matrix((M, size))
null_mat[:, indeces] = x
The problem is that for N = 1,500,000, P = 5,000 and K = 200 this code won't scale and it will give me a memory error. The exact error is: "return np.zeros(self.shape, dtype = self.dtype, order=order) MemoryError".
I just want to add some null columns so I guess my slicing idea is inefficient, especially as K << N in my real data. In a way we can think about this as a merge sort problem - I have a non-null and a null dataset and I want to concatenate them, but in a specific order. Any ideas on how to make it work?
Thanks!
回答1:
As I deduced in the comments, the memory error was produced in the
null_mat[:, indeces] = x
line because the lil
__setitem__
method, does a x.toarray()
, that is, it first converts x
to a dense array. Mapping the sparse matrix onto the index lil
directly might be more space efficient, but a lot more work to code. And lil
is optimized for iterative assignment, not this large scale matrix mapping.
sparse.hstack
uses sparse.bmat
to join sparse matrices. This converts all inputs to coo
, and then combines their attributes into a new set, building the new matrix from those.
direct coo matrix construction
After quite a bit of playing around, I found that the following simple operation works:
In [479]: z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
In [480]: z1
Out[480]:
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
with 5000 stored elements in COOrdinate format>
Compare this with the x
and null_mat
:
In [481]: x
Out[481]:
<5000x10 sparse matrix of type '<class 'numpy.float64'>'
with 5000 stored elements in COOrdinate format>
In [482]: null_mat
Out[482]:
<5000x12 sparse matrix of type '<class 'numpy.float64'>'
with 5000 stored elements in LInked List format>
Testing the equality of sparse matrices can be tricky. coo
values in particular can occur in any order, such as in x
which was produced by sparse.random
.
But the csr
format orders the rows, so this comparison of the indptr
attribute is a pretty good equality test:
In [483]: np.allclose(null_mat.tocsr().indptr, z1.tocsr().indptr)
Out[483]: True
A time test:
In [477]: timeit z1=sparse.coo_matrix((x.data, (x.row, indeces[x.col])),shape=(M,size))
108 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [478]:
In [478]: timeit null_mat[:, indeces] = x
3.05 ms ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
matrix multiplication approach
csr
format indexing with lists is done with matrix multiplication. It constructs an extractor
matrix, and applies that. Matrix multiplication is a csr_matrix
strong point.
We can perform the reordering in the same way:
In [489]: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)), shape=(10,12))
In [490]: I
Out[490]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [496]: w1=x*I
Comparing the dense equivalents of these matrices:
In [497]: np.allclose(null_mat.A, z1.A)
Out[497]: True
In [498]: np.allclose(null_mat.A, w1.A)
Out[498]: True
In [499]: %%timeit
...: I = sparse.csr_matrix((np.ones(10),(np.arange(10),indeces)),shape=(10,
...: 12))
...: w1=x*I
1.11 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That's better than the lil
indexing approach, though still much slower than the direct coo
matrix construction. Though to be fair, we should construct a csr
matrix from the coo
style inputs. That conversion takes some time:
In [502]: timeit z2=sparse.csr_matrix((x.data, (x.row, indeces[x.col])),shape=(M
...: ,size))
639 µs ± 604 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
error traceback
The MemoryError traceback should have revealed that the error occurred in this indexed assignment, and that the relevant method calls are:
Signature: null_mat.__setitem__(index, x)
Source:
def __setitem__(self, index, x):
....
if isspmatrix(x):
x = x.toarray()
...
Signature: x.toarray(order=None, out=None)
Source:
def toarray(self, order=None, out=None):
"""See the docstring for `spmatrix.toarray`."""
B = self._process_toarray_args(order, out)
Signature: x._process_toarray_args(order, out)
Source:
def _process_toarray_args(self, order, out):
...
return np.zeros(self.shape, dtype=self.dtype, order=order)
I found this by doing a code search on the scipy
github, for the np.zeros
calls.
来源:https://stackoverflow.com/questions/49652985/inserting-null-columns-into-a-scipy-sparse-matrix-in-a-specific-order