parallel parLapply setup

后端 未结 1 1316
無奈伤痛
無奈伤痛 2020-12-10 14:43

I am trying to use part of speech tagging from the openNLP/NLP packages in parallel. I need the code to work on any OS so am opting to use the parLapply functi

相关标签:
1条回答
  • 2020-12-10 15:11

    Since you're calling functions from NLP on the cluster workers, you should load it on each of the workers before calling parLapply. You can do that from the worker function, but I tend to use clusterCall or clusterEvalQ right after creating the cluster object:

    clusterEvalQ(cl, {library(openNLP); library(NLP)})
    

    Since as.String and Maxent_Word_Token_Annotator are in those packages, they shouldn't be exported.

    Note that while running your example on my machine, I noticed that the PTA object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using clusterEvalQ, the example ran successfully. Here it is, using openNLP 0.2-1:

    library(parallel)
    tagPOS <-  function(x, ...) {
        s <- as.String(x)
        word_token_annotator <- Maxent_Word_Token_Annotator()
        a2 <- Annotation(1L, "sentence", 1L, nchar(s))
        a2 <- annotate(s, word_token_annotator, a2)
        a3 <- annotate(s, PTA, a2)
        a3w <- a3[a3$type == "word"]
        POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
        POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
        list(POStagged = POStagged, POStags = POStags)
    }
    text.var <- c("I like it.", "This is outstanding soup!",
        "I really must get the recipe.")
    cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
    clusterEvalQ(cl, {
        library(openNLP)
        library(NLP)
        PTA <- Maxent_POS_Tag_Annotator()
    })
    m <- parLapply(cl, text.var, tagPOS)
    print(m)
    stopCluster(cl)
    

    If clusterEvalQ fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing sessionInfo with clusterEvalQ:

    library(parallel)
    cl <- makeCluster(2)
    clusterEvalQ(cl, {library(openNLP); library(NLP)})
    clusterEvalQ(cl, sessionInfo())
    

    This will return the results of executing sessionInfo() on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me:

    other attached packages:
    [1] NLP_0.1-0     openNLP_0.2-1
    
    loaded via a namespace (and not attached):
    [1] openNLPdata_1.5.3-1 rJava_0.9-4
    
    0 讨论(0)
提交回复
热议问题