Finding 2 & 3 word Phrases Using R TM Package

后端 未结 7 1964
死守一世寂寞
死守一世寂寞 2020-11-28 04:26

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it th

7条回答
  •  -上瘾入骨i
    2020-11-28 05:11

    You can pass in a custom tokenizing function to tm's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward.

    library(tm); library(tau);
    
    tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
    
    texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
    corpus <- Corpus(VectorSource(texts))
    matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
    

    Where n in the tokenize_ngrams function is the number of words per phrase. This feature is also implemented in package RTextTools, which further simplifies things.

    library(RTextTools)
    texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
    matrix <- create_matrix(texts,ngramLength=3)
    

    This returns a class of DocumentTermMatrix for use with package tm.

提交回复
热议问题