Efficiently populate SciPy sparse matrix from subset of dictionary

此生再无相见时 提交于 2019-12-24 03:41:58

问题


I need to store word co-occurrence counts in several 14000x10000 matrices. Since I know the matrices will be sparse and I do not have enough RAM to store all of them as dense matrices, I am storing them as scipy.sparse matrices.

I have found the most efficient way to gather the counts to be using Counter objects. Now I need to transfer the counts from the Counter objects to the sparse matrices, but this takes too long. It currently takes on the order of 18 hours to populate the matrices.

The code I'm using is roughly as follows:

for word_ind1 in range(len(wordlist1)):
    for word_ind2 in range(len(wordlist2)):
        word_counts[word_ind2, word_ind1]=word_counters[wordlist1[word_ind1]][wordlist2[word_ind2]]

Where word_counts is a scipy.sparse.lil_matrix object, word_counters is a dictionary of counters, and wordlist1 and wordlist2 are lists of strings.

Is there any way to do this more efficiently?


回答1:


You're using LIL matrices, which (unfortunately) have a linear-time insertion algorithm. Therefore, constructing them in this way takes quadratic time. Try a DOK matrix instead, those use hash tables for storage.

However, if you're interested in boolean term occurrences, then computing the co-occurrence matrix is much faster if you have a sparse term-document matrix. Let A be such a matrix of shape (n_documents, n_terms), then the co-occurrence matrix is

A.T * A


来源:https://stackoverflow.com/questions/22796118/efficiently-populate-scipy-sparse-matrix-from-subset-of-dictionary

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!