The code below causes my system to run out of memory before it completes.
Can you suggest a more efficient means of computing the cosine similarity on a large matri
You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix
to actually compress your data. In fact, there are almost surely no zeros at all.
If you look closely at your MemoryError
traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:
MemoryError Traceback (most recent call last)
887 Y_normalized = normalize(Y, copy=True)
888
--> 889 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
890
891 return K
So the problem isn't with cosine_similarity
, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:
>>> a = np.zeros((65000, 10))
>>> i = np.random.rand(a.size)
>>> a.flat[i < 0.01] = 1 # Select 1% of indices and set to 1
>>> a = sparse.csr_matrix(a)
Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:
>>> b = cosine_similarity(a)
>>> b
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])