How do I calculate a word-word co-occurrence matrix with sklearn?

后端未结

关注

 6  996

南旧 2020-12-01 03:06

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtain

6条回答

我在风中等你 (楼主)

2020-12-01 03:44

You can use the ngram_range parameter in the CountVectorizer or TfidfVectorizer

Code example:

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words

In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns and batman forever:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())

Final output is ('awesome unicorns', 1), ('batman forever', 2), which corresponds exactly to our samples provided data.

0 讨论(0)

查看其它6个回答