I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.
I can get the document-term matrix but not sure how to go about obtain
You can use the ngram_range parameter in the CountVectorizer or TfidfVectorizer
Code example:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns and batman forever:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1})
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
Final output is ('awesome unicorns', 1), ('batman forever', 2), which corresponds exactly to our samples provided data.