foreach %dopar% + RPostgreSQL

后端 未结 2 1755
囚心锁ツ
囚心锁ツ 2020-12-05 08:37

I am using RPostgreSQL to connect to a local database. The setup works just fine on my Linux machine. R 2.11.1, Postgres 8.4.

I was playing with the \'foreach\' with

相关标签:
2条回答
  • 2020-12-05 09:23

    It's more efficient to create the database connection once per worker, rather than once per task. Unfortunately, mclapply doesn't provide a mechanism for initializing the workers before executing tasks, so it's not easy to do this using the doMC backend, but if you use the doParallel backend, you can initialize the workers using clusterEvalQ. Here's an example of how to restructure the code:

    library(doParallel)
    cl <- makePSOCKcluster(detectCores())
    registerDoParallel(cl)
    
    clusterEvalQ(cl, {
      library(DBI)
      library(RPostgreSQL)
      drv <- dbDriver("PostgreSQL")
      con <- dbConnect(drv, dbname="nsdq")
      NULL
    })
    
    id.qed.foreach <- foreach(i=1588:3638, .inorder=FALSE,
                              .noexport="con",
                              .packages=c("DBI", "RPostgreSQL")) %dopar% {
      lst <- eval(expr.01)  #contains the SQL query which depends on 'i'
      qry <- dbSendQuery(con, lst)
      tmp <- fetch(qry, n=-1)
      dt <- dates.qed2[i]
      list(date=dt, idreuters=tmp$idreuters)
    }
    
    clusterEvalQ(cl, {
      dbDisconnect(con)
    })
    

    Since doParallel and clusterEvalQ are using the same cluster object cl, the foreach loop will have access to the database connection object con when executing the tasks.

    0 讨论(0)
  • 2020-12-05 09:26

    The following works and speeds up by ~ 1.5x over a sequential form. As a next step, I am wondering whether it is possible to attach a connection object to each of the workers spawned by registerDoMC. If so, then there would be no need to create/destroy the connection objects, which prevents from overwhelming the PostgreSQL server with connections.

    pgparquery <- function(i) {
    drv <- dbDriver("PostgreSQL"); 
    con <- dbConnect(drv, dbname='nsdq'); 
    lst <- eval(expr.01); #contains the SQL query which depends on 'i'
    qry <- dbSendQuery(con,lst);
    tmp <- fetch(qry,n=-1);
    dt <- dates.qed2[i]
    dbDisconnect(con);
    result <- list(date=dt, idreuters=tmp$idreuters)
    return(result)}
    
    id.qed.foreach <- foreach(i = 1588:3638, .inorder=FALSE, .packages=c("DBI", "RPostgreSQL")) %dopar% {pgparquery(i)}
    

    --
    Vishal Belsare

    0 讨论(0)
提交回复
热议问题