The removeCommonTerms function is found here for the TM package such that
removeCommonTerms <- function (x, pct)
{
stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")),
is.numeric(pct), pct > 0, pct < 1)
m <- if (inherits(x, "DocumentTermMatrix"))
t(x)
else x
t <- table(m$i) < m$ncol * (pct)
termIndex <- as.numeric(names(t[t]))
if (inherits(x, "DocumentTermMatrix"))
x[, termIndex]
else x[termIndex, ]
}
now I would like to remove too common terms with the Quanteda package. I could do this removal before creating the Document-feature matrix or with the document-feature matrix.
How to remove too common terms with the Quanteda package in R?
You want the dfm_trim
function. From ?dfm_trim
max_docfreq
maximum number or fraction of documents in which a feature appears, above which features will be removed. (Default is no upper limit.)
This requires the newest version of quanteda (fresh on CRAN).
packageVersion("quanteda")
## [1] ‘0.9.9.3’
inaugdfm <- dfm(data_corpus_inaugural)
dfm_trim(inaugdfm, max_docfreq = .8)
## Removing features occurring:
## - in more than 0.8 * 57 = 45.6 documents: 93
## Total features removed: 93 (1.01%).
## Document-feature matrix of: 57 documents, 9,081 features (92.4% sparse).
来源:https://stackoverflow.com/questions/41589266/r-removecommonterms-with-quanteda-package