Removing overly common words (occur in more than 80% of the documents) in R

前端 未结 2 1753
梦如初夏
梦如初夏 2021-01-01 01:48

I am working with the \'tm\' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occ

相关标签:
2条回答
  • 2021-01-01 02:17

    If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:

    ndocs <- length(dcs)
    # ignore overly sparse terms (appearing in less than 1% of the documents)
    minDocFreq <- ndocs * 0.01
    # ignore overly common terms (appearing in more than 80% of the documents)
    maxDocFreq <- ndocs * 0.8
    dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))
    
    0 讨论(0)
  • 2021-01-01 02:24

    What if you made a removeCommonTerms function

    removeCommonTerms <- function (x, pct) 
    {
        stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
            is.numeric(pct), pct > 0, pct < 1)
        m <- if (inherits(x, "DocumentTermMatrix")) 
            t(x)
        else x
        t <- table(m$i) < m$ncol * (pct)
        termIndex <- as.numeric(names(t[t]))
        if (inherits(x, "DocumentTermMatrix")) 
            x[, termIndex]
        else x[termIndex, ]
    }
    

    Then if you wanted to remove terms that in are >=80% of the documents, you could do

    data("crude")
    dtm <- DocumentTermMatrix(crude)
    dtm
    # <<DocumentTermMatrix (documents: 20, terms: 1266)>>
    # Non-/sparse entries: 2255/23065
    # Sparsity           : 91%
    # Maximal term length: 17
    # Weighting          : term frequency (tf)
    
    removeCommonTerms(dtm ,.8)
    # <<DocumentTermMatrix (documents: 20, terms: 1259)>>
    # Non-/sparse entries: 2129/23051
    # Sparsity           : 92%
    # Maximal term length: 17
    # Weighting          : term frequency (tf)
    
    0 讨论(0)
提交回复
热议问题