I have a list of names like:
names = [\'A\', \'B\', \'C\', \'D\']
and a list of documents, that in each documents some of these names are m
You can also use matrix tricks in order to find the co-occurrence matrix too. Hope this works well when you have bigger vocabulary.
import scipy.sparse as sp
voc2id = dict(zip(names, range(len(names))))
rows, cols, vals = [], [], []
for r, d in enumerate(document):
for e in d:
if voc2id.get(e) is not None:
rows.append(r)
cols.append(voc2id[e])
vals.append(1)
X = sp.csr_matrix((vals, (rows, cols)))
Now, you can find coocurrence matrix by simple multiply X.T
with X
Xc = (X.T * X) # coocurrence matrix
Xc.setdiag(0)
print(Xc.toarray())