snowfall | 易学教程

Fast correlation in R using C and parallelization

阅读更多关于 Fast correlation in R using C and parallelization

问题 My project for today was to write a fast correlation routine in R using the basic skillset I have. I have to find the correlation between almost 400 variables each having almost a million observations (i.e. a matrix of size p=1MM rows & n=400 cols). R's native correlation function takes almost 2 mins for 1MM rows and 200 observations per variable. I have not run for 400 observations per column, but my guess is it will take almost 8 mins. I have less than 30 secs to finish it. Hence, I want to

Communication of parallel processes: what are my options?

阅读更多关于 Communication of parallel processes: what are my options?

I'm trying to dig a bit deeper into parallelziation of R routines. What are my options with respect to the communication of a bunch of "worker" processes regarding the communication between the respective workers ? the communication of the workers with the " master " process? AFAIU, there's no such thing as a " shared environment/shared memory " that both the master as well as all worker processes have access to, right? The best idea I came up with so far is to base the communication on reading and writing JSON documents to the hard drive. That's probably a bad idea ;-) I chose .json over

Fast correlation in R using C and parallelization

阅读更多关于 Fast correlation in R using C and parallelization

My project for today was to write a fast correlation routine in R using the basic skillset I have. I have to find the correlation between almost 400 variables each having almost a million observations (i.e. a matrix of size p=1MM rows & n=400 cols). R's native correlation function takes almost 2 mins for 1MM rows and 200 observations per variable. I have not run for 400 observations per column, but my guess is it will take almost 8 mins. I have less than 30 secs to finish it. Hence, I want to do do things. 1 - write a simple correlation function in C and apply it in blocks parallely (see below

How to manage parallel processing with animated ggplot2-plot?

阅读更多关于 How to manage parallel processing with animated ggplot2-plot?

I'm trying to build an animated barplot with ggplot2 and magick that's growing on a "day per day" base. Unfortunately, I've got tenthousands of entries in my dataset (dates for each day for several years and different categories), which makes processing very slow. Thus, I'm using the snow package to speed up processing time. However, I ran into trouble when splitting my data and calling ggplot() in a cluster. magick requires to split the data per date for animation and snow requires splitting per cluster for parallel processing. So, I'm getting a list of lists, which causes problems when

How to calculate number of occurrences per minute for a large dataset

阅读更多关于 How to calculate number of occurrences per minute for a large dataset

I have a dataset with 500k appointments lasting between 5 and 60 minutes. tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325496900, 1325496900, 1325498100, 1325498100, 1325498400,

writing to global environment when running in parallel

阅读更多关于 writing to global environment when running in parallel

问题 I have a data.frame of cells, values and coordinates. It resides in the global environment. > head(cont.values) cell value x y 1 11117 NA -34 322 2 11118 NA -30 322 3 11119 NA -26 322 4 11120 NA -22 322 5 11121 NA -18 322 6 11122 NA -14 322 Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each

writing to global environment when running in parallel

阅读更多关于 writing to global environment when running in parallel

I have a data.frame of cells, values and coordinates. It resides in the global environment. > head(cont.values) cell value x y 1 11117 NA -34 322 2 11118 NA -30 322 3 11119 NA -26 322 4 11120 NA -22 322 5 11121 NA -18 322 6 11122 NA -14 322 Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution. What my function actually does

How to calculate number of occurrences per minute for a large dataset

阅读更多关于 How to calculate number of occurrences per minute for a large dataset

问题 I have a dataset with 500k appointments lasting between 5 and 60 minutes. tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000,

Initializing MPI cluster with snowfall R

阅读更多关于 Initializing MPI cluster with snowfall R

问题 I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node. Here's how I'm initializing it: sfInit(parallel=TRUE, cpus=10, type="MPI") Any ideas? I'll provide clarification as needed. 回答1: To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a