Cosine similarity of 2 DTMs in R

限于喜欢 提交于 2019-12-01 07:26:58

问题


I have 2 Document term matrices:

  1. DTM 1 has say 1000 vectors(1000 docs) and
  2. DTM2 has 20 vectors (20 docs)

So basically I want to compare each document of DTM1 against DTM2 and would want to see which DTM1 docs are closest to which DTM2 docs using the cosine function. Any pointers would help!

I have created a cosine matrix using the "slam" package.

Docs   –glyma –ie   –initi –stafford ‘bureaucratic’ ‘empti ‘holi ‘incontrovert
  1  0.000000   0 0.000000  0.000000       0.000000      0     0             0
  2  0.000000   0 0.000000  0.000000       0.000000      0     0             0
  3  0.000000   0 0.000000  0.000000       0.000000      0     0             0
  4  0.000000   0 0.000000  0.000000       0.000000      0     0             0
  5  0.000000   0 0.000000  0.000000       0.000000      0     0             0
  6  0.000000   0 0.000000  0.000000       4.906891      0     0             0
  7  0.000000   0 0.000000  4.906891       0.000000      0     0             0
  8  0.000000   0 0.000000  0.000000       0.000000      0     0             0
  9  0.000000   0 4.906891  0.000000       0.000000      0     0             0
  10 4.906891   0 0.000000  0.000000       0.000000      0     0             0

The cosine function results are:

However, this matrix compares the docs of DTM1 with one another. I want these vectors to be compared with the vectors of DTM2 and then find the closest DTM2 document for a given DTM1 document.


回答1:


Here is a way to calculate the cosine distance between two matrices. The use of tm is just for data purposes...

library(slam)
library(tm)
data("acq")
data("crude")

dtm <- DocumentTermMatrix(c(acq, crude))

index <- sample(1:70, size = 10)

dtm1 <- dtm[index, ]
dtm2 <- dtm[-index, ]

cosine_sim <- tcrossprod_simple_triplet_matrix(dtm1, dtm2)/sqrt(row_sums(dtm1^2) %*% t(row_sums(dtm2^2)))

The cosine function was adapted from this SO post: R: Calculate cosine distance from a term-document matrix with tm and proxy



来源:https://stackoverflow.com/questions/41721431/cosine-similarity-of-2-dtms-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!