More efficient means of creating a corpus and DTM with 4M rows

前端 未结 4 2102
予麋鹿
予麋鹿 2020-12-12 21:18

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

4条回答
  •  抹茶落季
    2020-12-12 21:43

    I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

    In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

    data <- data.frame(
        text=c("Let the big dogs hunt",
            "No holds barred",
            "My child is an honor student"
        ), stringsAsFactors = F)
    
    ## eliminate this step to work as a MWE
    data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
    
    library(stringi)
    library(SnowballC)
    out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
    names(out) <- paste0("doc", 1:length(out))
    
    lev <- sort(unique(unlist(out)))
    dat <- do.call(cbind, lapply(out, function(x, lev) {
        tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
    }, lev = lev))
    rownames(dat) <- sort(lev)
    
    library(tm)
    dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 
    
    library(slam)
    dat2 <- slam::as.simple_triplet_matrix(dat)
    
    tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
    tdm
    
    ## or...
    dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
    dtm
    

提交回复
热议问题