How do I calculate a word-word co-occurrence matrix with sklearn?

问题

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-ocurrences.

回答1:

Here is my example solution using CountVectorizer in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.

from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
        'this cat good',
        'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format

You can also refer to dictionary of words in count_model,

count_model.vocabulary_

Or, if you want to normalize by diagonal component (referred to answer in previous post).

import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix

Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.

X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...

回答2:

@titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words apple and house appears with this frecuency:

text1: apple:10, "house":1

text2: apple:10, "house":0

text3: apple:10, "house":0

text4: apple:10, "house":0

text5: apple:10, "house":0

The co-occurrence we are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.

And, in this another important cases, like the following:

text1: apple:1, "banana":1

text2: apple:1, "banana":1

text3: apple:1, "banana":1

text4: apple:1, "banana":1

text5: apple:1, "banana":1

we are going to get just a co-occurrence of 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.

@Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.

I propose to use something the @titipa solution to compute the matrix:

Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format

where, instead of using X, use a matrix Y with ones in positions greater than 0 and zeros in another positions.

Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1 and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5 which is what we are really looking for.

回答3:

You can use the ngram_range parameter in the CountVectorizer or TfidfVectorizer

Code example:

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words

In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns and batman forever:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())

Final output is ('awesome unicorns', 1), ('batman forever', 2), which corresponds exactly to our samples provided data.

回答4:

All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrame object representing the co-occurrence matrix and a window_size number:

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

Let's try it out given the following two simple sentences:

>>> text = ["I go to school every day by bus .",
            "i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text, 2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]

Now, we have our co-occurrence matrix.

来源：https://stackoverflow.com/questions/35562789/how-do-i-calculate-a-word-word-co-occurrence-matrix-with-sklearn

标签

python

matrix

scikit-learn