Working with large text files in R to create n-grams

问题

I am trying to create trigrams and bigrams from a large (1GB) text file using the 'quanteda' package in the R programming environment. If I try and run my code in one go (as below) R just hangs (on the 3rd line - myCorpus<-toLower(...)). I used the code successfully on a small dataset <1mb, so I guess the file is too large. I can see I perhaps need to load the text in 'chunks' and combine the resulting frequencies of bigrams and trigrams afterwards. But I cannot work out how to load and process the text in manageable 'chunks'. Any advice on an approach to this problem would be very welcome. My code is pasted below. Any suggestions for other approaches for improving my code would be also welcome.

  folder.dataset.english <- 'final/corpus'


myCorpus <- corpus(x=textfile(list.files(path = folder.dataset.english, pattern = "\\.txt$", full.names = TRUE, recursive = FALSE)))  # build the corpus

myCorpus<-toLower(myCorpus, keepAcronyms = TRUE)

#bigrams
bigrams<-dfm(myCorpus, ngrams = 2,verbose = TRUE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,removeTwitter = TRUE, stem = FALSE) 
bigrams_freq<-sort(colSums(bigrams),decreasing=T)
bigrams<-data.frame(names=names(bigrams_freq),freq=bigrams_freq,stringsAsFactors =FALSE)
bigrams$first<- sapply(strsplit(bigrams$names, "_"), "[[", 1)
bigrams$last<-  sapply(strsplit(bigrams$names, "_"), "[[", 2)
rownames(bigrams)<-NULL
bigrams.freq.freq<-table(bigrams$freq)
saveRDS(bigrams,"dictionaries/bigrams.rds")

#trigrams
trigrams<-dfm(myCorpus, ngrams = 3,verbose = TRUE, toLower = TRUE,
              removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
              removeTwitter = TRUE, stem = FALSE) 
trigrams_freq<-sort(colSums(trigrams),decreasing=T)
trigrams<-data.frame(names=names(trigrams_freq),freq=trigrams_freq,stringsAsFactors =FALSE)

trigrams$first<-paste(sapply(strsplit(trigrams$names, "_"), "[[", 1),sapply(strsplit(trigrams$names, "_"), "[[", 2),sep="_")
trigrams$last<-sapply(strsplit(trigrams$names, "_"), "[[", 3)
rownames(trigrams)<-NULL
saveRDS(trigrams,"dictionaries/trigrams.rds")

回答1:

After much headache, I kind of solved this myself, in a very brute force way, which I am slightly embarrassed about, but I will show it anyway! I am sure there are more elegant and efficient ways (please feel free to educate me) I only need to process this text once, so I guess the inelegant solution doesn't matter quite so much.

I converted to a 'tm' package V.Corpus object, this consisted of three large text files, then iterated through the three text files and manually sliced up the corpus processing each chunk at a time. I have here not inserted and plumbed in the processing code given above for clarity of understanding. I just indicated where I needed to stitch that in. I just now need to add some code to accumulate the results from each chunk.

library(tm)

 folder.dataset.english <- 'final/corpus'
    corpus <- VCorpus(DirSource(directory=folder.dataset.english, encoding = "UTF-8",recursive=FALSE),
                      readerControl = list(language = "en"))
    chunk.size<-100000


    for(t in 1:3){
        l=1
        h=chunk.size
        stp=0
        corp.size<-length(corpus[[t]]$content)
          repeat{  
            if(stp==2)break
            corpus.chunk<-corpus[[t]]$content[l:h]
            l<-h+1
            h<-h+chunk.size
    ####Processing code in here


    #####Processing code ends here
            if(h>corp.size){
            h<-corp.size
            stp<-stp+1      }
                  }
                }

来源：https://stackoverflow.com/questions/41552345/working-with-large-text-files-in-r-to-create-n-grams

标签

nlp