cosine similarity on large sparse matrix with numpy

前端未结

关注

 3  496

逝去的感伤 2020-12-16 20:18

The code below causes my system to run out of memory before it completes.

Can you suggest a more efficient means of computing the cosine similarity on a large matri

3条回答

我在风中等你 (楼主)

2020-12-16 20:26

I would run it in chunks like this

from sklearn.metrics.pairwise import cosine_similarity

# Change chunk_size to control resource consumption and speed
# Higher chunk_size means more memory/RAM needed but also faster 
chunk_size = 500 
matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarray

def similarity_cosine_by_chunk(start, end):
    if end > matrix_len:
        end = matrix_len
    return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn function

for chunk_start in xrange(0, matrix_len, chunk_size):
    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
    # Handle cosine_similarity_chunk  ( Write it to file_timestamp and close the file )
    # Do not open the same file again or you may end up with out of memory after few chunks

0 讨论(0)

查看其它3个回答