R: removeCommonTerms with Quanteda package?

拟墨画扇 提交于 2019-12-08 06:01:32

问题


The removeCommonTerms function is found here for the TM package such that

removeCommonTerms <- function (x, pct) 
{
    stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
        is.numeric(pct), pct > 0, pct < 1)
    m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
    else x
    t <- table(m$i) < m$ncol * (pct)
    termIndex <- as.numeric(names(t[t]))
    if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
    else x[termIndex, ]
}

now I would like to remove too common terms with the Quanteda package. I could do this removal before creating the Document-feature matrix or with the document-feature matrix.

How to remove too common terms with the Quanteda package in R?


回答1:


You want the dfm_trim function. From ?dfm_trim

max_docfreq maximum number or fraction of documents in which a feature appears, above which features will be removed. (Default is no upper limit.)

This requires the newest version of quanteda (fresh on CRAN).

packageVersion("quanteda")
## [1] ‘0.9.9.3’

inaugdfm <- dfm(data_corpus_inaugural)

dfm_trim(inaugdfm, max_docfreq = .8)
## Removing features occurring: 
##   - in more than 0.8 * 57 = 45.6 documents: 93
##   Total features removed: 93 (1.01%).
## Document-feature matrix of: 57 documents, 9,081 features (92.4% sparse).


来源:https://stackoverflow.com/questions/41589266/r-removecommonterms-with-quanteda-package

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!