Does tm package itself provide a built-in way to combine document-term matrices?

烂漫一生 提交于 2019-12-06 09:53:38

问题


Does tm package itself provide a built-in way to combine document-term matrices?

I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it does,I don't want to rebuild something that has already been built.

If it doesn't, is there any handier way to combine dtms than writing a program to record indices of non-zero elements of dtms and then making a sparse matrix?


回答1:


Have you tried tm_combine? You can use it via the generic function c like so:

require(tm)
data("acq")
data("crude")
summary(c(acq, crude))
summary(c(acq[[30]], crude[[10]]))
c(TermDocumentMatrix(acq), TermDocumentMatrix(crude))



回答2:


As an example, create two document-term matrices (one for unigrams, one for bigrams) for corpus acq:

library(tm)
data(acq)

tokenize_bigrams <- function(x) {
  rownames(as.data.frame(unclass(tau::textcnt(x$content, method="string", n=2))))
}

m1 <- DocumentTermMatrix(acq)
m2 <- DocumentTermMatrix(acq, control=list(tokenize=tokenize_bigrams))

dim(m1)
# [1]   50 2103

dim(m2)
# [1]   50 5100

Combine them using cbind. It works because tm relies on package slam, which provides a cbind method for simple triplet matrices:

m <- cbind(m1, m2)

dim(m)
# [1]   50 7203

As expected, the resulting matrix m has 50 rows (for 50 documents in acq) and 7203 columns (2103 for unigrams + 5100 for bigrams).

Note that m is a plain simple triplet matrix:

m
# A 50x7203 simple triplet matrix.

If you want to use it as a document-term matrix, you can do:

attributes(m) <- attributes(m1)

Then:

m
# <<DocumentTermMatrix (documents: 50, terms: 7203)>>
# Non-/sparse entries: 10706/349444
# Sparsity           : 97%
# Maximal term length: 29
# Weighting          : term frequency (tf)


来源:https://stackoverflow.com/questions/19993504/does-tm-package-itself-provide-a-built-in-way-to-combine-document-term-matrices

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!