I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.
I can get the document-term matrix but not sure how to go about obtain
All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrame
object representing the co-occurrence matrix and a window_size
number:
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
Let's try it out given the following two simple sentences:
>>> text = ["I go to school every day by bus .",
"i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text, 2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
Now, we have our co-occurrence matrix.