writing to global environment when running in parallel

徘徊边缘 提交于 2019-11-30 14:24:23

The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)

fun <- function(x) { Sys.sleep(1); x }

We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that

memoize <-
    function(FUN)
{
    force(FUN) # circumvent lazy evaluation
    require(rredis)
    redisConnect()
    function(x)
    {
        key <- as.character(x)
        val <- redisGet(key)
        if (is.null(val)) {
            val <- FUN(x)
            redisSet(key, val)
        }
        val
    }
}

We then memoize our function

funmem <- memoize(fun)

and go

> system.time(res <- funmem(10)); res
   user  system elapsed 
  0.003   0.000   1.082 
[1] 10
> system.time(res <- funmem(10)); res
   user  system elapsed 
  0.001   0.001   0.040 
[1] 10

This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.

A within-R parallel version might be

library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))

It will depend on what the function in question is, off course, but I'm afraid that snowfall won't be much of a help there. Thing is, you'll have to export the whole dataframe to the different cores (see ?sfExport) and still find a way to combine it. That kind of beats the whole purpose of changing the value in the global environment, as you probably want to keep memory use as low as possible.

You can dive into the low-level functions of snow to -kind of- get this to work. See following example :

#Some data
Data <- data.frame(
  cell = 1:10,
  value = sample(c(100,NA),10,TRUE),
  x = 1:10,
  y = 1:10
)
# A sample function
sample.func <- function(){
    id <- which(is.na(Data$value)) # get the NA values

    # this splits up the values from the dataframe in a list
    # which will be passed to clusterApply later on.
    parts <- lapply(clusterSplit(cl,id),function(i)Data[i,c("x","y")])

    # Here happens the magic
    Data$value[id] <<-
    unlist(clusterApply(cl,parts,function(x){
        x$x+x$y
      }
    ))
}
#now we run it
require(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
sample.func()
stopCluster(cl)
> Data
   cell value  x  y
1     1   100  1  1
2     2   100  2  2
3     3     6  3  3
4     4     8  4  4
5     5    10  5  5
6     6    12  6  6
7     7   100  7  7
8     8   100  8  8
9     9    18  9  9
10   10    20 10 10

You will still have to copy (part of) your data though to get it to the cores. But that will happen anyway when you call snowfall high level functions on dataframes, as snowfall uses the low-level function of snow anyway.

Plus, one shouldn't forget that if you change one value in a dataframe, the whole dataframe is copied in the memory as well. So you won't win that much by adding the values one by one when they come back from the cluster. You might want to try some different approaches and do some memory profiling as well.

I agree with Joris that you will need to copy your data to the other cores. On the positive side, you don't have to worry about NA's being in the data or not, within the cores. If your original data.frame is called cont.values:

nnaidx<-is.na(cont.values$value) #where is missing data originally
dfrnna<-cont.values[nnaidx,] #subset for copying to other cores
calcValForDfrRow<-function(dfrRow){return(dfrRow$x+dfrRow$y)}#or whatever pleases you
sfExport(dfrnna, calcValForDfrRow) #export what is needed to other cores
cont.values$value[nnaidx]<-sfSapply(seq(dim(dfrnna)[1]), function(i){calcValForDfrRow(dfrnna[i,])}) #sfSapply handles 'reordering', so works exactly as if you had called sapply

should work nicely (barring typos)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!