Finding 2 & 3 word Phrases Using R TM Package

后端 未结 7 2010
死守一世寂寞
死守一世寂寞 2020-11-28 04:26

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it th

7条回答
  •  甜味超标
    2020-11-28 05:18

    I add a similar problem by using tm and ngram packages. After debugging mclapply, I saw there where problems on documents with less than 2 words with the following error

       input 'x' has nwords=1 and n=2; must have nwords >= n
    

    So I've added a filter to remove document with low word count number:

        myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
          length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
        })
    

    Then my tokenize function looks like:

    bigramTokenizer <- function(x) {
      x <- as.character(x)
    
      # Find words
      one.list <- c()
      tryCatch({
        one.gram <- ngram::ngram(x, n = 1)
        one.list <- ngram::get.ngrams(one.gram)
      }, 
      error = function(cond) { warning(cond) })
    
      # Find 2-grams
      two.list <- c()
      tryCatch({
        two.gram <- ngram::ngram(x, n = 2)
        two.list <- ngram::get.ngrams(two.gram)
      },
      error = function(cond) { warning(cond) })
    
      res <- unlist(c(one.list, two.list))
      res[res != '']
    }
    

    Then you can test the function with:

    dtmTest <- lapply(myCorpus.3, bigramTokenizer)
    

    And finally:

    dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))
    

提交回复
热议问题