cosine similarity on large sparse matrix with numpy

前端未结

关注

 3  487

逝去的感伤 2020-12-16 20:18

The code below causes my system to run out of memory before it completes.

Can you suggest a more efficient means of computing the cosine similarity on a large matri

3条回答

伪装坚强ぢ (楼主)

2020-12-16 20:27
You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix to actually compress your data. In fact, there are almost surely no zeros at all.

If you look closely at your MemoryError traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:
```
MemoryError                  Traceback (most recent call last)
    887         Y_normalized = normalize(Y, copy=True)
    888 
--> 889     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
    890 
    891     return K
```
So the problem isn't with cosine_similarity, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:
```
>>> a = np.zeros((65000, 10))
>>> i = np.random.rand(a.size)
>>> a.flat[i < 0.01] = 1        # Select 1% of indices and set to 1
>>> a = sparse.csr_matrix(a)
```
Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:
```
>>> b = cosine_similarity(a)
>>> b
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...