r remove sparse terms by type of documents

老子叫甜甜 提交于 2019-12-13 02:36:31

问题


I'm new with corpus, I have a big corpus where there are 4 types of documents. I want remove sparse terms inside the types. I can't just create separated corpus because they had a lot of transformations before, using some post I created a TermDocumentMatrix with the name of the type in each column, but I can't find the way to remove sparse terms by type. Any idea? Thanks you!!

Just for example i removed sparse terms for all the corpus

TDM_1 <- removeSparseTerms(TDM, 0.98) 
inspect(TDM_1) <<TermDocumentMatrix (terms: 27, documents: 2583)>> 
Non-    /sparse entries: 3591/66150 Sparsity : 95% Maximal term length: 12  Weighting : term frequency (tf) 

TDM_1$dimnames (Types of documents to remove sparse terms)

EDIT: Thanks for the coments, I realized that my corpus was wrong. I changed the transformer functions and created one TermDocumentMatrix by type. But now I have another problem to remove sparse terms. Suppose my TDM are tdm_1, tdm_2.

library(tm)
library(Rstem)

data(crude)

spl <- runif(length(crude)) < 0.7
crude_1 <- crude[spl]
crude_2 <- crude[!spl]

controls <- list(
  tolower = TRUE,
  removePunctuation = TRUE,
  stopwords = stopwords("english"),
  stemming = function(word) wordStem(word, language = "english")
)

tdm_1 <- TermDocumentMatrix(crude_1, controls)
tdm_2 <- TermDocumentMatrix(crude_2, controls)

## Don´t work.

for(i in 1:2){
  assign(paste0("TDM_", i), 
     removeSparseTerms(paste0('tdm_', i), 0.98)
}

## But this is ok.

removeSparseTerms(tdm_1, 0.98)

Thanks again!

来源:https://stackoverflow.com/questions/32870373/r-remove-sparse-terms-by-type-of-documents

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!