why does as.matrix result in memory overload while running text mining in R

扶醉桌前 提交于 2020-01-04 20:45:43

问题


I am doing a text analysis with R package tm.

My code is based on this link: https://www.r-bloggers.com/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know/

The text-files I load are only 4800 kB. The text files are a 10% sample of the original files I want to analyze.

My code is:

library(tm)
library(wordcloud)
library(SnowballC)
library(textmineR)
library(RWeka)

blogssub <- readLines("10kblogs.txt")
newssub <- readLines("10knews.txt")
tweetssub <- readLines("10ktwitter.txt")

corpussubset <- c(blogssub,newssub,tweetssub)
cpsub <- corpussubset

cpsubclean <- VCorpus(VectorSource(cpsub))

# make ngrams
unigram<- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))

options(mc.cores=1) #hangs if you dont include this option on Mac OS

tdmuni<- TermDocumentMatrix(cpsubclean, control=list(tokenize=unigram))

m <- as.matrix(tdmuni)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)    

The code gives the following error: "Cannot allocate vector of size 12.3 Gb"

The error is caused by line: m <- as.matrix(tdmuni)

Can it be the case that my code is not efficient in some way? I am suprised that such a huge vector is allocated of 12.3 Gb since the orginal textfiles are only 4800 kB.

Thanks a lot!

来源:https://stackoverflow.com/questions/50890935/why-does-as-matrix-result-in-memory-overload-while-running-text-mining-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!