Removing overly common words (occur in more than 80% of the documents) in R

前端未结

关注

 2  1765

梦如初夏 2021-01-01 01:48

I am working with the \'tm\' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occ

2条回答

自闭症患者 (楼主)

2021-01-01 02:17

If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:

ndocs <- length(dcs)
# ignore overly sparse terms (appearing in less than 1% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than 80% of the documents)
maxDocFreq <- ndocs * 0.8
dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))

0 讨论(0)

查看其它2个回答