Math of tm::findAssocs how does this function work?

半腔热情 提交于 2019-11-27 17:14:15

问题


I have been using findAssoc() with textmining (tm package) but realized that something doesn't seem right with my dataset.

My dataset is 1500 open ended answers saved in one column of csv file. So I called the dataset like this and used typical tm_map to make it to corpus.

library(tm)
Q29 <- read.csv("favoritegame2.csv")
corpus <- Corpus(VectorSource(Q29$Q29))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
dtm<- DocumentTermMatrix(corpus)

findAssocs(dtm, "like", .2)
> cousin  fill  ....
  0.28    0.20      

Q1. When I find Terms associated with like, I don't see the output like = 1 as part of the output. However,

dtm.df <-as.data.frame(inspect(dtm))

this dataframe consists of 1500 obs. of 1689 variables..(Or is it because the data is save in a row of csv file?)

Q2. Even though cousin and fill showed up once when the target term like showed up once, the score is different like this. Shouldn't they be same?

I'm trying to find the math of findAssoc() but no success yet. Any advice is highly appreciated!


回答1:


 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
{
    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))

That was where self-references were removed.

    findAssocs(x.cor, term, corlimit)
}
<environment: namespace:tm>
#-------------
 getAnywhere(findAssocs.matrix)
#-------------
A single object matching ‘findAssocs.matrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
sort(round(x[term, which(x[term, ] > corlimit)], 2), decreasing = TRUE)
<environment: namespace:tm>



回答2:


I don't think anyone has answered your final question.

I'm trying to find the math of findAssoc() but no success yet. Any advice is highly appreciated!

The math of findAssoc() is based on the standard function cor() in the stats package of R. Given two numeric vectors, cor() computes their covariance divided by both the standard deviations.

So given a DocumentTermMatrix dtm containing terms "word1" and "word2" such that findAssocs(dtm, "word1", 0) returns "word2" with a value of x, the correlation of the term vectors for "word1" and "word2" is x.

For a long-winded example

> data <-  c("", "word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5") 
> dtm <- DocumentTermMatrix(VCorpus(VectorSource(data)))
> as.matrix(dtm)
    Terms
Docs word1 word2 word3 word4 word5
   1     0     0     0     0     0
   2     1     0     0     0     0
   3     1     1     0     0     0
   4     1     1     1     0     0
   5     1     1     1     1     0
   6     1     1     1     1     1
> findAssocs(dtm, "word1", 0) 
$word1
word2 word3 word4 word5 
 0.63  0.45  0.32  0.20 

> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word2"])
[1] 0.6324555
> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word3"])
[1] 0.4472136

and so on for words 4 and 5.

See also http://r.789695.n4.nabble.com/findAssocs-tt3845751.html#a4637248




回答3:


Incidentally, if your term-document matrix is very large, you may want to try this version of findAssocs:

# u is a term document matrix (transpose of a DTM)
# term is your term
# corlimit is a value -1 to 1

findAssocsBig <- function(u, term, corlimit){
  suppressWarnings(x.cor <-  gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),        
                                         as.matrix(t(u[  u$dimnames$Terms == term, ]))  ))  
  x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
  return(x)
}

The advantage of this is that it uses a different method of converting the TDM to a matrix tm:findAssocs. This different method uses memory more efficiently and means you can use large TDMs (or DTMs) than tm:findAssocs can handle. Of course with a big enough TDM/DTM you'll get an error about memory allocation with this function also.




回答4:


Your dtm has 1689 variables because that is the number of unique words in your observations (excluding stop words and numbers). Probably the word "like" shows up in more than one of your 1500 observations and isn't always accompanied by "cousin" and "fill". Did you count how many times "like" shows up?



来源:https://stackoverflow.com/questions/14267199/math-of-tmfindassocs-how-does-this-function-work

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!