Working with large text files in R to create n-grams

走远了吗. 提交于 2019-12-22 01:21:46

问题


I am trying to create trigrams and bigrams from a large (1GB) text file using the 'quanteda' package in the R programming environment. If I try and run my code in one go (as below) R just hangs (on the 3rd line - myCorpus<-toLower(...)). I used the code successfully on a small dataset <1mb, so I guess the file is too large. I can see I perhaps need to load the text in 'chunks' and combine the resulting frequencies of bigrams and trigrams afterwards. But I cannot work out how to load and process the text in manageable 'chunks'. Any advice on an approach to this problem would be very welcome. My code is pasted below. Any suggestions for other approaches for improving my code would be also welcome.

  folder.dataset.english <- 'final/corpus'


myCorpus <- corpus(x=textfile(list.files(path = folder.dataset.english, pattern = "\\.txt$", full.names = TRUE, recursive = FALSE)))  # build the corpus

myCorpus<-toLower(myCorpus, keepAcronyms = TRUE)

#bigrams
bigrams<-dfm(myCorpus, ngrams = 2,verbose = TRUE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,removeTwitter = TRUE, stem = FALSE) 
bigrams_freq<-sort(colSums(bigrams),decreasing=T)
bigrams<-data.frame(names=names(bigrams_freq),freq=bigrams_freq,stringsAsFactors =FALSE)
bigrams$first<- sapply(strsplit(bigrams$names, "_"), "[[", 1)
bigrams$last<-  sapply(strsplit(bigrams$names, "_"), "[[", 2)
rownames(bigrams)<-NULL
bigrams.freq.freq<-table(bigrams$freq)
saveRDS(bigrams,"dictionaries/bigrams.rds")

#trigrams
trigrams<-dfm(myCorpus, ngrams = 3,verbose = TRUE, toLower = TRUE,
              removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
              removeTwitter = TRUE, stem = FALSE) 
trigrams_freq<-sort(colSums(trigrams),decreasing=T)
trigrams<-data.frame(names=names(trigrams_freq),freq=trigrams_freq,stringsAsFactors =FALSE)

trigrams$first<-paste(sapply(strsplit(trigrams$names, "_"), "[[", 1),sapply(strsplit(trigrams$names, "_"), "[[", 2),sep="_")
trigrams$last<-sapply(strsplit(trigrams$names, "_"), "[[", 3)
rownames(trigrams)<-NULL
saveRDS(trigrams,"dictionaries/trigrams.rds")

回答1:


After much headache, I kind of solved this myself, in a very brute force way, which I am slightly embarrassed about, but I will show it anyway! I am sure there are more elegant and efficient ways (please feel free to educate me) I only need to process this text once, so I guess the inelegant solution doesn't matter quite so much.

I converted to a 'tm' package V.Corpus object, this consisted of three large text files, then iterated through the three text files and manually sliced up the corpus processing each chunk at a time. I have here not inserted and plumbed in the processing code given above for clarity of understanding. I just indicated where I needed to stitch that in. I just now need to add some code to accumulate the results from each chunk.

library(tm)

 folder.dataset.english <- 'final/corpus'
    corpus <- VCorpus(DirSource(directory=folder.dataset.english, encoding = "UTF-8",recursive=FALSE),
                      readerControl = list(language = "en"))
    chunk.size<-100000


    for(t in 1:3){
        l=1
        h=chunk.size
        stp=0
        corp.size<-length(corpus[[t]]$content)
          repeat{  
            if(stp==2)break
            corpus.chunk<-corpus[[t]]$content[l:h]
            l<-h+1
            h<-h+chunk.size
    ####Processing code in here


    #####Processing code ends here
            if(h>corp.size){
            h<-corp.size
            stp<-stp+1      }
                  }
                }


来源:https://stackoverflow.com/questions/41552345/working-with-large-text-files-in-r-to-create-n-grams

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!