how to read and write TermDocumentMatrix in r?

自古美人都是妖i 提交于 2019-12-24 06:36:20

问题


I made wordcloud using a csv file in R. I used TermDocumentMatrix method in the tm package. Here is my code:

csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)

Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))

myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))

m <- as.matrix(TDM)

This process seemed to take too much time. I think extractNoun is what accounts for too much time being spent. To make the code more time-efficient, I want to save the resulting TDM as a file. When I read this saved file, can I use m <- as.matrix(saved TDM file) completely? Or, is there a better alternative?


回答1:


I'm not an expert but I've used NLP sometimes.

I do use parSapply from parallel package. Here's the documentation http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

parallel comes with R base and this is a silly using example:

library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")

base <- 2
parSapply(cl, as.character(2:4), 
          function(exponent){
            x <- as.numeric(exponent)
            c(base = base^x, self = x^x)
          })

So, parallelize nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F) and it will be faster :)




回答2:


I noticed that you have a call to several library(tm) commands which can also easily be parallelized. For library tm this functionality was updated in March 2017, a month after your question.

In the new features section of the release notes for library tm version 0.7 (2017-03-02) it is indicated:

tm_parLapply() is now internally used for the parallelization of transformations, filters, and term-document matrix construction. The preferred parallelization engine can be registered via tm_parLapply_engine(). The default is to use no parallelization (instead of mclapply (package parallel) in previous versions).

To set up parallelization for the tm commands the following has worked for me:

library(parallel)
cores <- detectCores()
cl <- makeCluster(cores)   # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus, 
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)

If you have function that you are applying through a tm_map content transformer, you will need to use clusterExport to pass that function to the parallelized environment before the tm_map(MyCorpus, content_transformer(clean)) command. EG. passing my clean function to the environment.

clusterExport(cl, "clean") 

One last comment, keep an eye on your memory usage. If your computer starts paging memory out to disk the CPU is no longer the critical path and all the parallelization won't make a difference.



来源:https://stackoverflow.com/questions/42103676/how-to-read-and-write-termdocumentmatrix-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!