How do I calculate a word-word co-occurrence matrix with sklearn?

后端 未结 6 995
南旧
南旧 2020-12-01 03:06

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtain

6条回答
  •  自闭症患者
    2020-12-01 03:37

    @titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words apple and house appears with this frecuency:

    text1: apple:10, "house":1

    text2: apple:10, "house":0

    text3: apple:10, "house":0

    text4: apple:10, "house":0

    text5: apple:10, "house":0

    The co-occurrence we are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.

    And, in this another important cases, like the following:

    text1: apple:1, "banana":1

    text2: apple:1, "banana":1

    text3: apple:1, "banana":1

    text4: apple:1, "banana":1

    text5: apple:1, "banana":1

    we are going to get just a co-occurrence of 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.

    @Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.

    I propose to use something the @titipa solution to compute the matrix:

    Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format
    

    where, instead of using X, use a matrix Y with ones in positions greater than 0 and zeros in another positions.

    Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1 and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5 which is what we are really looking for.

提交回复
热议问题