How to convert co-occurrence matrix to sparse matrix

爷,独闯天下 提交于 2019-12-05 22:18:21

Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):

vocabulary = {}  # map terms to column indices
data = []        # values (maybe weights)
row = []         # row (document) indices
col = []         # column (term) indices

for i, doc in enumerate(documents):
    for term in doc:
        # get column index, adding the term to the vocabulary if needed
        j = vocabulary.setdefault(term, len(vocabulary))
        data.append(1)  # uniform weights
        row.append(i)
        col.append(j)

A = scipy.sparse.coo_matrix((data, (row, col)))

Now, to get a cooccurrence matrix:

A.T * A

(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).

Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!