How to find term frequency within a DTM in R?

我的未来我决定 提交于 2019-12-07 12:44:28

问题


I've been using the tm package to create a DocumentTerm Matrix as follows:

library(tm)
library(RWeka)
library(SnowballC)
src <- DataframeSource(data.frame(data3$JobTitle))

# create a corpus and transform data
# Sets the default number of threads to use
options(mc.cores=1)
c_copy <- c <- Corpus(src)
c <- tm_map(c, content_transformer(tolower), mc.cores=1)
c <- tm_map(c,content_transformer(removeNumbers), mc.cores=1)
c <- tm_map(c,removeWords, stopwords("english"), mc.cores=1)
c <- tm_map(c,content_transformer(stripWhitespace), mc.cores=1)

#make DTM
dtm <- DocumentTermMatrix(c, control = list(tokenize = BigramTokenizer))

Now, the DTM comes out fine - what I want to do is get the frequencies of the frequent terms within the DTM. Obviously, I can use findFreqTerms to get the terms themselves, but not the actual frequencies. termFreq only works on TextDocument, not a DTM or TDM - any ideas?

Output from str - the frequent terms are in $ Terms:

> str(dtm)
List of 6
 $ i       : int [1:190] 1 2 3 4 5 6 7 8 9 10 ...
 $ j       : int [1:190] 1 2 3 4 5 6 7 8 9 10 ...
 $ v       : num [1:190] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 119
 $ ncol    : int 146
 $ dimnames:List of 2
  ..$ Docs : chr [1:119] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:146] "account administrator" "account assistant" "account director" "account executive" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

回答1:


Thanks to NicE for the advice - it works well. Adding in the weighting argument allows me to get out the term frequencies when I inspect the DTM. Simple matter then of summing up per column.

dtm <- DocumentTermMatrix(c, control = list(tokenize = BigramTokenizer, weighting=weightTf))
freqs <- as.data.frame(inspect(dtm))
colSums(freqs)



回答2:


You can use Tyler Rinker's excellent qdap package. The freq_term function gives the terms and their frequencies. This example takes the 30 most frequent terms, if they have at least 4 letters, and uses one of qdap's stopword packages -- which is more extensive than the built in tm stopword in English (200 vs about 175).

qdap.freq <- freq_terms(dtm, top = 20, at.least = 4, stopwords = Top200Words) 


来源:https://stackoverflow.com/questions/28580460/how-to-find-term-frequency-within-a-dtm-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!