How do I calculate a word-word co-occurrence matrix with sklearn?

后端 未结 6 975
南旧
南旧 2020-12-01 03:06

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtain

6条回答
  •  抹茶落季
    2020-12-01 03:40

    All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrame object representing the co-occurrence matrix and a window_size number:

    def co_occurrence(sentences, window_size):
        d = defaultdict(int)
        vocab = set()
        for text in sentences:
            # preprocessing (use tokenizer instead)
            text = text.lower().split()
            # iterate over sentences
            for i in range(len(text)):
                token = text[i]
                vocab.add(token)  # add to vocab
                next_token = text[i+1 : i+1+window_size]
                for t in next_token:
                    key = tuple( sorted([t, token]) )
                    d[key] += 1
    
        # formulate the dictionary into dataframe
        vocab = sorted(vocab) # sort vocab
        df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                          index=vocab,
                          columns=vocab)
        for key, value in d.items():
            df.at[key[0], key[1]] = value
            df.at[key[1], key[0]] = value
        return df
    

    Let's try it out given the following two simple sentences:

    >>> text = ["I go to school every day by bus .",
                "i go to theatre every night by bus"]
    >>> 
    >>> df = co_occurrence(text, 2)
    >>> df
             .  bus  by  day  every  go  i  night  school  theatre  to
    .        0    1   1    0      0   0  0      0       0        0   0
    bus      1    0   2    1      0   0  0      1       0        0   0
    by       1    2   0    1      2   0  0      1       0        0   0
    day      0    1   1    0      1   0  0      0       1        0   0
    every    0    0   2    1      0   0  0      1       1        1   2
    go       0    0   0    0      0   0  2      0       1        1   2
    i        0    0   0    0      0   2  0      0       0        0   2
    night    0    1   1    0      1   0  0      0       0        1   0
    school   0    0   0    1      1   1  0      0       0        0   1
    theatre  0    0   0    0      1   1  0      1       0        0   1
    to       0    0   0    0      2   2  2      0       1        1   0
    
    [11 rows x 11 columns]
    

    Now, we have our co-occurrence matrix.

提交回复
热议问题