cosine similarity on large sparse matrix with numpy

前端 未结 3 496
逝去的感伤
逝去的感伤 2020-12-16 20:18

The code below causes my system to run out of memory before it completes.

Can you suggest a more efficient means of computing the cosine similarity on a large matri

3条回答
  •  我在风中等你
    2020-12-16 20:26

    I would run it in chunks like this

    from sklearn.metrics.pairwise import cosine_similarity
    
    # Change chunk_size to control resource consumption and speed
    # Higher chunk_size means more memory/RAM needed but also faster 
    chunk_size = 500 
    matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarray
    
    def similarity_cosine_by_chunk(start, end):
        if end > matrix_len:
            end = matrix_len
        return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn function
    
    for chunk_start in xrange(0, matrix_len, chunk_size):
        cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
        # Handle cosine_similarity_chunk  ( Write it to file_timestamp and close the file )
        # Do not open the same file again or you may end up with out of memory after few chunks 
    

提交回复
热议问题