R: tm Textmining package: Doc-Level metadata generation is slow

纵饮孤独 提交于 2019-12-06 10:51:45

Here's a bit of benchmarking...

With the for loop :

expr.for <- function() {
  for (i in seq(from= 1, to=length(corpus), by=1)){
    DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
    DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
  }
}

microbenchmark(expr.for())
# Unit: milliseconds
#         expr      min       lq   median       uq      max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398

With tm_map :

corpus <- crude

expr.map <- function() {
  tm_map(corpus, function(x) {
    meta(x, "title") = LETTERS[round(runif(26))]
    meta(x, "Publisher" ) = LETTERS[round(runif(26))]
    x
  })
}

microbenchmark(expr.map())
# Unit: milliseconds
#         expr      min       lq   median       uq      max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482

So the tm_map version, as you noticed, seems to be about 4 times faster.

In your question you say that the changes in the tm_map version are not persistent, it is because you don't return x at the end of your anonymous function. In the end it should be :

meta(x, tag = "Heading") = csv[[i,10]]  
meta(x, tag = "publisher" ) = csv[[i,16]] 
x
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!