Text Mining in R | memory management

邮差的信 提交于 2019-11-28 01:46:13

@Vineet here is the math that shows why R tried to allocate 603Gb to convert the document term matrix to a non-sparse matrix. Each number cell in a matrix in R consumes 8 bytes. Based on the size of the document term matrix in the question, the math looks like:

> # 
> # calculate memory consumed by matrix
> #
> 
> rows <- 472029 # 
> cols <- 171548
> # memory in gigabytes
> rows * cols * 8 / (1024 * 1024 * 1024)
[1] 603.3155

If you want to calculate the word frequencies, you're better off generating 1-grams and then summarizing them into a frequency distribution.

With the quanteda package the code would look like this.

words <- tokenize(...) 
ngram1 <- unlist(tokens_ngrams(words,n=1))
ngram1freq <- data.frame(table(ngram1))

regards,

Len

2017-11-24 UPDATE: Here is a complete example from the quanteda package that generates the frequency distribution from a document feature matrix using the textstat_frequency() function, as well as a barplot() for the top 20 features.

This approach does not require the generation & aggregation of n-grams into a frequency distribution.

library(quanteda)
myCorpus <- corpus(data_char_ukimmig2010)
system.time(theDFM <- dfm(myCorpus,tolower=TRUE,
                      remove=c(stopwords(),",",".","-","\"","'","(",")",";",":")))
system.time(textFreq <- textstat_frequency(theDFM))

hist(textFreq$frequency,
     main="Frequency Distribution of Words: UK 2010 Election Manifestos")

top20 <- textFreq[1:20,]
barplot(height=top20$frequency,
        names.arg=top20$feature,
        horiz=FALSE,
        las=2,
        main="Top 20 Words: UK 2010 Election Manifestos")

...and the resulting barplot:

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!