Removing overly common words (occur in more than 80% of the documents) in R

前端 未结 2 1758
梦如初夏
梦如初夏 2021-01-01 01:48

I am working with the \'tm\' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occ

2条回答
  •  自闭症患者
    2021-01-01 02:17

    If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:

    ndocs <- length(dcs)
    # ignore overly sparse terms (appearing in less than 1% of the documents)
    minDocFreq <- ndocs * 0.01
    # ignore overly common terms (appearing in more than 80% of the documents)
    maxDocFreq <- ndocs * 0.8
    dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))
    

提交回复
热议问题