Finding 2 & 3 word Phrases Using R TM Package

后端 未结 7 2005
死守一世寂寞
死守一世寂寞 2020-11-28 04:26

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it th

7条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-11-28 04:57

    This is part 5 of the FAQ of the tm package:

    5. Can I use bigrams instead of single tokens in a term-document matrix?

    Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

      library("RWeka")
      library("tm")
    
      data("crude")
    
      BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
      tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
    
      inspect(tdm[340:345,1:10])
    

提交回复
热议问题