问题
I'm new with corpus, I have a big corpus where there are 4 types of documents. I want remove sparse terms inside the types. I can't just create separated corpus because they had a lot of transformations before, using some post I created a TermDocumentMatrix with the name of the type in each column, but I can't find the way to remove sparse terms by type. Any idea? Thanks you!!
Just for example i removed sparse terms for all the corpus
TDM_1 <- removeSparseTerms(TDM, 0.98)
inspect(TDM_1) <<TermDocumentMatrix (terms: 27, documents: 2583)>>
Non- /sparse entries: 3591/66150 Sparsity : 95% Maximal term length: 12 Weighting : term frequency (tf)
TDM_1$dimnames (Types of documents to remove sparse terms)
EDIT: Thanks for the coments, I realized that my corpus was wrong. I changed the transformer functions and created one TermDocumentMatrix by type. But now I have another problem to remove sparse terms. Suppose my TDM are tdm_1, tdm_2.
library(tm)
library(Rstem)
data(crude)
spl <- runif(length(crude)) < 0.7
crude_1 <- crude[spl]
crude_2 <- crude[!spl]
controls <- list(
tolower = TRUE,
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = function(word) wordStem(word, language = "english")
)
tdm_1 <- TermDocumentMatrix(crude_1, controls)
tdm_2 <- TermDocumentMatrix(crude_2, controls)
## Don´t work.
for(i in 1:2){
assign(paste0("TDM_", i),
removeSparseTerms(paste0('tdm_', i), 0.98)
}
## But this is ok.
removeSparseTerms(tdm_1, 0.98)
Thanks again!
来源:https://stackoverflow.com/questions/32870373/r-remove-sparse-terms-by-type-of-documents