bigrams instead of single words in termdocument matrix using R and Rweka
问题 I\'ve found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R The idea goes something like this: library(tm) library(RWeka) data(crude) #Tokenizer for n-grams and passed on to the term-document matrix constructor BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) However