Remove empty documents from DocumentTermMatrix in R topicmodels?

后端未结

关注

 6  2051

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:

相关标签:

6条回答

予麋鹿

2020-11-30 23:25

This is just to elaborate on the answer given by agstudy.

Instead of removing the empty rows from the dtm matrix, we can identify the documents in our corpus that have zero length and remove the documents directly from the corpus, before performing a second dtm with only non empty documents.

This is useful to keep a 1:1 correspondence between the dtm and the corpus.

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]] corpus <- corpus[-as.numeric(empty.rows)]

0 讨论(0)
发布评论:

提交评论
- 加载中...

独厮守ぢ

2020-11-30 23:30

"Each row of the input matrix needs to contain at least one non-zero entry"

The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]           #remove all docs without words

0 讨论(0)

梦毁少年i

2020-11-30 23:35
I had a column in a data frame lt$title which contained strings. I had no "empty" rows in this column, but still got the error:

Error in LDA(dtm, k = 20, control = list(seed = 813)) : Each row of the input matrix needs to contain at least one non-zero entry

Some of the solutions above did not work for me, since I needed to join the vector of predicted topics to my original data frame. So removing non-zero entries from the document term matrix was no option.

The problem was, that some (very short) strings in lt$title contained special characters which could not be processed by Corpus() and/or DocumentTermMatrix().

My solution was to remove "short" strings (one or two words max.) which do not carry much information anyway.
```
# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL

# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))

# Add "topics" to original DF
lt$topic = topics(tm)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-11-30 23:43
Just small addendum to the answer of Dario Lacan:
```
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
```
will collect record's id, rather than order numbers. Try this:
```
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
```
If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:
```
corpus <- tm_filter(
  corpus,
  FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
  # !( meta(doc)$id %in% emptyRows )
)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-11-30 23:44
agstudy's answer works great, but using it on a slow computer proved mildly problematic.
```
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
```
(this was done with a 4000x15000 dtm)

The bottleneck appears to be applying sum() to a sparse matrix.

A document-term-matrix created by the tm package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i does not contain a particular row index p, then row p is empty.
```
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
```
ui contains all the non-zero indices, and since dtm$i is already ordered, dtm.new will be in the same order as dtm. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.
0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-11-30 23:46
Just remove the sparse terms from the DTM and all will work well.
```
dtm <- DocumentTermMatrix(crude, sparse=TRUE)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...