I am working with the \'tm\' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occ
If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:
ndocs <- length(dcs)
# ignore overly sparse terms (appearing in less than 1% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than 80% of the documents)
maxDocFreq <- ndocs * 0.8
dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))
What if you made a removeCommonTerms
function
removeCommonTerms <- function (x, pct)
{
stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")),
is.numeric(pct), pct > 0, pct < 1)
m <- if (inherits(x, "DocumentTermMatrix"))
t(x)
else x
t <- table(m$i) < m$ncol * (pct)
termIndex <- as.numeric(names(t[t]))
if (inherits(x, "DocumentTermMatrix"))
x[, termIndex]
else x[termIndex, ]
}
Then if you wanted to remove terms that in are >=80% of the documents, you could do
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm
# <<DocumentTermMatrix (documents: 20, terms: 1266)>>
# Non-/sparse entries: 2255/23065
# Sparsity : 91%
# Maximal term length: 17
# Weighting : term frequency (tf)
removeCommonTerms(dtm ,.8)
# <<DocumentTermMatrix (documents: 20, terms: 1259)>>
# Non-/sparse entries: 2129/23051
# Sparsity : 92%
# Maximal term length: 17
# Weighting : term frequency (tf)