Text Mining in R | memory management

前端 未结 1 525
情深已故
情深已故 2020-12-06 23:49

I am using a text file of 160 MB and doing data mining, but seems once I convert it to matrix to know the word frequency then its demanding too much memory, can someone one

相关标签:
1条回答
  • 2020-12-07 00:51

    @Vineet here is the math that shows why R tried to allocate 603Gb to convert the document term matrix to a non-sparse matrix. Each number cell in a matrix in R consumes 8 bytes. Based on the size of the document term matrix in the question, the math looks like:

    > # 
    > # calculate memory consumed by matrix
    > #
    > 
    > rows <- 472029 # 
    > cols <- 171548
    > # memory in gigabytes
    > rows * cols * 8 / (1024 * 1024 * 1024)
    [1] 603.3155
    

    If you want to calculate the word frequencies, you're better off generating 1-grams and then summarizing them into a frequency distribution.

    With the quanteda package the code would look like this.

    words <- tokenize(...) 
    ngram1 <- unlist(tokens_ngrams(words,n=1))
    ngram1freq <- data.frame(table(ngram1))
    

    regards,

    Len

    2017-11-24 UPDATE: Here is a complete example from the quanteda package that generates the frequency distribution from a document feature matrix using the textstat_frequency() function, as well as a barplot() for the top 20 features.

    This approach does not require the generation & aggregation of n-grams into a frequency distribution.

    library(quanteda)
    myCorpus <- corpus(data_char_ukimmig2010)
    system.time(theDFM <- dfm(myCorpus,tolower=TRUE,
                          remove=c(stopwords(),",",".","-","\"","'","(",")",";",":")))
    system.time(textFreq <- textstat_frequency(theDFM))
    
    hist(textFreq$frequency,
         main="Frequency Distribution of Words: UK 2010 Election Manifestos")
    
    top20 <- textFreq[1:20,]
    barplot(height=top20$frequency,
            names.arg=top20$feature,
            horiz=FALSE,
            las=2,
            main="Top 20 Words: UK 2010 Election Manifestos")
    

    ...and the resulting barplot:

    0 讨论(0)
提交回复
热议问题