cosine similarity on large sparse matrix with numpy

前端 未结 3 487
逝去的感伤
逝去的感伤 2020-12-16 20:18

The code below causes my system to run out of memory before it completes.

Can you suggest a more efficient means of computing the cosine similarity on a large matri

3条回答
  •  伪装坚强ぢ
    2020-12-16 20:27

    You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix to actually compress your data. In fact, there are almost surely no zeros at all.

    If you look closely at your MemoryError traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:

    MemoryError                  Traceback (most recent call last)
        887         Y_normalized = normalize(Y, copy=True)
        888 
    --> 889     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
        890 
        891     return K
    

    So the problem isn't with cosine_similarity, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:

    >>> a = np.zeros((65000, 10))
    >>> i = np.random.rand(a.size)
    >>> a.flat[i < 0.01] = 1        # Select 1% of indices and set to 1
    >>> a = sparse.csr_matrix(a)
    

    Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:

    >>> b = cosine_similarity(a)
    >>> b
    array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ..., 
           [ 0.,  0.,  0., ...,  1.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.]])
    

提交回复
热议问题