R: tm Textmining package: Doc-Level metadata generation is slow

I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files).

This for-loop works but it is very slow, Performance seems to degrade as a function f ~ 1/n_docs.

for (i in seq(from= 1, to=length(corpus), by=1)){
    if(opts$options$verbose == TRUE || i %% 50 == 0){
        print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " "))
    }
    DublinCore(corpus[[i]], "title") = csv[[i,10]]  
    DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]]   #institutions
}

This may do something to the corpus variable but I don't know what. But when I put it inside a tm_map() (similar to lapply() function), it runs much faster, but the changes are not made persistent:

i = 0
corpus = tm_map(corpus, function(x){
            i <<- i + 1


    if(opts$options$verbose == TRUE){
        print(paste(i, " ", substr(x, 1, 140), sep = " "))
    }

    meta(x, tag = "Heading") = csv[[i,10]]  
    meta(x, tag = "publisher" ) = csv[[i,16]] 
})

Variable corpus has empty metadata fields after exiting the tm_map function. It should be filled. I have a few other things to do with the collection.

The R documentation for the meta() function says this:

     Examples:
      data("crude")
      meta(crude[[1]])
      DublinCore(crude[[1]])
      meta(crude[[1]], tag = "Topics")
      meta(crude[[1]], tag = "Comment") <- "A short comment."
      meta(crude[[1]], tag = "Topics") <- NULL
      DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"
      DublinCore(crude[[1]], tag = "Format") <- "XML"
      DublinCore(crude[[1]])
      meta(crude[[1]])
      meta(crude)
      meta(crude, type = "corpus")
      meta(crude, "labels") <- 21:40
      meta(crude)

I tried many of these calls (with var "corpus" instead of "crude"), but they do not seem to work. Someone else once seemed to have had the same problem with a similar data set (forum post from 2009, no response)

Here's a bit of benchmarking...

With the for loop :

expr.for <- function() {
  for (i in seq(from= 1, to=length(corpus), by=1)){
    DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
    DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
  }
}

microbenchmark(expr.for())
# Unit: milliseconds
#         expr      min       lq   median       uq      max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398

With tm_map :

corpus <- crude

expr.map <- function() {
  tm_map(corpus, function(x) {
    meta(x, "title") = LETTERS[round(runif(26))]
    meta(x, "Publisher" ) = LETTERS[round(runif(26))]
    x
  })
}

microbenchmark(expr.map())
# Unit: milliseconds
#         expr      min       lq   median       uq      max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482

So the tm_map version, as you noticed, seems to be about 4 times faster.

In your question you say that the changes in the tm_map version are not persistent, it is because you don't return x at the end of your anonymous function. In the end it should be :

meta(x, tag = "Heading") = csv[[i,10]]  
meta(x, tag = "publisher" ) = csv[[i,16]] 
x

来源：https://stackoverflow.com/questions/14960863/r-tm-textmining-package-doc-level-metadata-generation-is-slow

标签

performance