I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it th
I add a similar problem by using tm and ngram packages.
After debugging mclapply, I saw there where problems on documents with less than 2 words with the following error
input 'x' has nwords=1 and n=2; must have nwords >= n
So I've added a filter to remove document with low word count number:
myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
})
Then my tokenize function looks like:
bigramTokenizer <- function(x) {
x <- as.character(x)
# Find words
one.list <- c()
tryCatch({
one.gram <- ngram::ngram(x, n = 1)
one.list <- ngram::get.ngrams(one.gram)
},
error = function(cond) { warning(cond) })
# Find 2-grams
two.list <- c()
tryCatch({
two.gram <- ngram::ngram(x, n = 2)
two.list <- ngram::get.ngrams(two.gram)
},
error = function(cond) { warning(cond) })
res <- unlist(c(one.list, two.list))
res[res != '']
}
Then you can test the function with:
dtmTest <- lapply(myCorpus.3, bigramTokenizer)
And finally:
dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))