More efficient means of creating a corpus and DTM with 4M rows

前端未结

关注

 4  2106

予麋鹿 2020-12-12 21:18

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

4条回答

抹茶落季 (楼主)

2020-12-12 21:43

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm

0 讨论(0)

查看其它4个回答