Remove stopwords and tolower function slow on a Corpus in R

问题

I have corpus roughly with 75 MB data. I am trying to use the following command

tm_map(doc.corpus, removeWords, stopwords("english"))
tm_map(doc.corpus, tolower)

This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model.

I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed.

I have a system with 4GB RAM and running a local database to read the input data.

Hoping for suggestions to speed up!

回答1:

Maybe you can give quanteda a try

library(stringi)
library(tm)
library(quanteda)

txt <- stri_rand_lipsum(100000L)
print(object.size(txt), units = "Mb")
# 63.4 Mb

system.time(
  dfm <- dfm(txt, toLower = TRUE, ignoredFeatures = stopwords("en")) 
)
# Elapsed time: 12.3 seconds.
#        User      System verstrichen 
#       11.61        0.36       12.30 

system.time(
  dtm <- DocumentTermMatrix(
    Corpus(VectorSource(txt)), 
    control = list(tolower = TRUE, stopwords = stopwords("en"))
  )
)
#  User      System verstrichen 
# 157.16        0.38      158.69

回答2:

Firstly I'd try

tm_map(doc.corpus, content_transformer(tolower))

Because tolower() isn't in a list of getTransformations()

来源：https://stackoverflow.com/questions/38377483/remove-stopwords-and-tolower-function-slow-on-a-corpus-in-r

标签

performance

text-mining

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!