Remove empty documents from DocumentTermMatrix in R topicmodels?

后端 未结 6 2056
鱼传尺愫
鱼传尺愫 2020-11-30 22:44

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:

<
6条回答
  •  感动是毒
    2020-11-30 23:43

    Just small addendum to the answer of Dario Lacan:

    empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
    

    will collect record's id, rather than order numbers. Try this:

    library(tm)
    data("crude")
    dtm <- DocumentTermMatrix(crude)
    dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
    

    If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:

    corpus <- tm_filter(
      corpus,
      FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
      # !( meta(doc)$id %in% emptyRows )
    )
    

提交回复
热议问题