Combining Multicore with Snow Cluster

♀尐吖头ヾ 提交于 2019-12-05 04:10:12

问题


Fairly new to Parallel R. Quick question. I have an algorithm that is computationally intensive. Fortunately it can easily be broken up into pieces to make use of multicore or snow. What I would like to know is if it is considered fine in practice to use multicore in conjunction with snow?

What I would like to do is split up my load to run on multiple machines in a cluster and for each machine. I would like to utilize all cores on the machine. For this type of processing, is it reasonable to mix snow with multicore?


回答1:


I have used the approach suggested above by lockedoff, that is use the parallel package to distribute an embarrassingly parallel workload over multiple machines with multiple cores. First the workload is distributed over all machines and then the workload of each machine is distributed over all it's cores. The disadvantage of this approach is that there is no load balancing between machines (at least I don't know how).

All loaded r code should be the same and on the same location on all machines (svn). Because initializing the clusters takes quite some time, the code below can be improved by reusing the created clusters.

foo <- function(workload, otherArgumentsForFoo) {
    source("/home/user/workspace/mycode.R")
    ...
}

distributedFooOnCores <- function(workload) {
    # Somehow assign a batch number to every record
    workload$ParBatchNumber = NA
    # Split the assigned workload into batches according to DistrParNumber
    batches = by(workload, workload$ParBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$ParBatchNumber = NULL
    return(invisible(results))
}

distributedFooOnMachines <- function(workload) {
    # Somehow assign a batch number to every record
    workload$DistrBatchNumber = NA
    # Split the assigned activity into batches according to DistrBatchNumber
    batches = by(workload, workload$DistrBatchNumber, function(x) x)

    # Create a cluster with workers on all machines 
    library("parallel")
    # If makeCluster hangs, please make sure passwordless ssh is configured on all machines
    cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
    stopCluster(cluster)

    # Merge the resulting batches
    results = someEmptyDataframe
    p = 1;
    for(i in 1:length(batches)){
        results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
        p = p + nrow(batches[[i]])      
    }

    # Clean up
    workload$DistrBatchNumber = NULL
    return(invisible(results))
}

I'm interested how the approach above can be improved.



来源:https://stackoverflow.com/questions/11624460/combining-multicore-with-snow-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!