Really fast word ngram vectorization in R
edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois(1000000, 5) + 1, samplefun, words, ' ') }) I can convert this character data to a bag-of-words