How do I calculate a word-word co-occurrence matrix with sklearn?

后端 未结 6 996
南旧
南旧 2020-12-01 03:06

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtain

6条回答
  •  我在风中等你
    2020-12-01 03:44

    You can use the ngram_range parameter in the CountVectorizer or TfidfVectorizer

    Code example:

    bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
    

    In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}

    http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

    Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns and batman forever:

    from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
    bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
    co_occurrences = bigram_vectorizer.fit_transform(samples)
    print 'Printing sparse matrix:', co_occurrences
    print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
    sum_occ = np.sum(co_occurrences.todense(),axis=0)
    print 'Sum of word-word occurrences:', sum_occ
    print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
    

    Final output is ('awesome unicorns', 1), ('batman forever', 2), which corresponds exactly to our samples provided data.

提交回复
热议问题