I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:
<
Just small addendum to the answer of Dario Lacan:
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
will collect record's id, rather than order numbers. Try this:
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:
corpus <- tm_filter(
corpus,
FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
# !( meta(doc)$id %in% emptyRows )
)