Removing overly common words (occur in more than 80% of the documents) in R

前端未结

关注

 2  1759

I am working with the \'tm\' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occ

相关标签:

2条回答

自闭症患者

2021-01-01 02:17

If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:

ndocs <- length(dcs)
# ignore overly sparse terms (appearing in less than 1% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than 80% of the documents)
maxDocFreq <- ndocs * 0.8
dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))

0 讨论(0)

梦如初夏

2021-01-01 02:24

What if you made a removeCommonTerms function

removeCommonTerms <- function (x, pct) 
{
    stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
        is.numeric(pct), pct > 0, pct < 1)
    m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
    else x
    t <- table(m$i) < m$ncol * (pct)
    termIndex <- as.numeric(names(t[t]))
    if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
    else x[termIndex, ]
}

Then if you wanted to remove terms that in are >=80% of the documents, you could do

data("crude")
dtm <- DocumentTermMatrix(crude)
dtm
# <<DocumentTermMatrix (documents: 20, terms: 1266)>>
# Non-/sparse entries: 2255/23065
# Sparsity           : 91%
# Maximal term length: 17
# Weighting          : term frequency (tf)

removeCommonTerms(dtm ,.8)
# <<DocumentTermMatrix (documents: 20, terms: 1259)>>
# Non-/sparse entries: 2129/23051
# Sparsity           : 92%
# Maximal term length: 17
# Weighting          : term frequency (tf)

0 讨论(0)