Does tm package itself provide a built-in way to combine document-term matrices?

风流意气都作罢 提交于 2019-12-04 17:42:18

Have you tried tm_combine? You can use it via the generic function c like so:

require(tm)
data("acq")
data("crude")
summary(c(acq, crude))
summary(c(acq[[30]], crude[[10]]))
c(TermDocumentMatrix(acq), TermDocumentMatrix(crude))

As an example, create two document-term matrices (one for unigrams, one for bigrams) for corpus acq:

library(tm)
data(acq)

tokenize_bigrams <- function(x) {
  rownames(as.data.frame(unclass(tau::textcnt(x$content, method="string", n=2))))
}

m1 <- DocumentTermMatrix(acq)
m2 <- DocumentTermMatrix(acq, control=list(tokenize=tokenize_bigrams))

dim(m1)
# [1]   50 2103

dim(m2)
# [1]   50 5100

Combine them using cbind. It works because tm relies on package slam, which provides a cbind method for simple triplet matrices:

m <- cbind(m1, m2)

dim(m)
# [1]   50 7203

As expected, the resulting matrix m has 50 rows (for 50 documents in acq) and 7203 columns (2103 for unigrams + 5100 for bigrams).

Note that m is a plain simple triplet matrix:

m
# A 50x7203 simple triplet matrix.

If you want to use it as a document-term matrix, you can do:

attributes(m) <- attributes(m1)

Then:

m
# <<DocumentTermMatrix (documents: 50, terms: 7203)>>
# Non-/sparse entries: 10706/349444
# Sparsity           : 97%
# Maximal term length: 29
# Weighting          : term frequency (tf)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!