Error in simple_triplet_matrix — unable to use RWeka to count Phrases

问题

Using TM, I'm comparing a DocumentTermMatrix against a dictionary list to count totals:

totals <- inspect(DocumentTermMatrix(x, list(dictionary = d)))

This works great for single words, but I want to include double words and can't figure out how to do this.

I tried RWeka:

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                               Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(v.corpus, 
                          control = list(tokenize = TrigramTokenizer))

BUt get the following error message:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  : 
  'i, j, v' different lengths
In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
  all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  NAs introduced by coercion.

Can you help with the Error message?

Thanks!!

回答1:

See my answer here

Seems there are problems using RWeka with parallel package. I found workaround solution here.

1: http://r.789695.n4.nabble.com/RWeka-and-multicore-package-td4678473.html#a4678948

The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

So your tokenizer should look like
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}

来源：https://stackoverflow.com/questions/20577040/error-in-simple-triplet-matrix-unable-to-use-rweka-to-count-phrases

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!