snowfall

Fast correlation in R using C and parallelization

烂漫一生 提交于 2019-12-03 13:46:37
问题 My project for today was to write a fast correlation routine in R using the basic skillset I have. I have to find the correlation between almost 400 variables each having almost a million observations (i.e. a matrix of size p=1MM rows & n=400 cols). R's native correlation function takes almost 2 mins for 1MM rows and 200 observations per variable. I have not run for 400 observations per column, but my guess is it will take almost 8 mins. I have less than 30 secs to finish it. Hence, I want to

Communication of parallel processes: what are my options?

守給你的承諾、 提交于 2019-12-03 09:06:27
I'm trying to dig a bit deeper into parallelziation of R routines. What are my options with respect to the communication of a bunch of "worker" processes regarding the communication between the respective workers ? the communication of the workers with the " master " process? AFAIU, there's no such thing as a " shared environment/shared memory " that both the master as well as all worker processes have access to, right? The best idea I came up with so far is to base the communication on reading and writing JSON documents to the hard drive. That's probably a bad idea ;-) I chose .json over

Fast correlation in R using C and parallelization

泄露秘密 提交于 2019-12-03 03:40:19
My project for today was to write a fast correlation routine in R using the basic skillset I have. I have to find the correlation between almost 400 variables each having almost a million observations (i.e. a matrix of size p=1MM rows & n=400 cols). R's native correlation function takes almost 2 mins for 1MM rows and 200 observations per variable. I have not run for 400 observations per column, but my guess is it will take almost 8 mins. I have less than 30 secs to finish it. Hence, I want to do do things. 1 - write a simple correlation function in C and apply it in blocks parallely (see below

How to manage parallel processing with animated ggplot2-plot?

ⅰ亾dé卋堺 提交于 2019-12-01 23:43:51
I'm trying to build an animated barplot with ggplot2 and magick that's growing on a "day per day" base. Unfortunately, I've got tenthousands of entries in my dataset (dates for each day for several years and different categories), which makes processing very slow. Thus, I'm using the snow package to speed up processing time. However, I ran into trouble when splitting my data and calling ggplot() in a cluster. magick requires to split the data per date for animation and snow requires splitting per cluster for parallel processing. So, I'm getting a list of lists, which causes problems when

How to calculate number of occurrences per minute for a large dataset

萝らか妹 提交于 2019-11-30 21:40:20
I have a dataset with 500k appointments lasting between 5 and 60 minutes. tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325496900, 1325496900, 1325498100, 1325498100, 1325498400,

writing to global environment when running in parallel

牧云@^-^@ 提交于 2019-11-30 15:35:45
问题 I have a data.frame of cells, values and coordinates. It resides in the global environment. > head(cont.values) cell value x y 1 11117 NA -34 322 2 11118 NA -30 322 3 11119 NA -26 322 4 11120 NA -22 322 5 11121 NA -18 322 6 11122 NA -14 322 Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each

writing to global environment when running in parallel

徘徊边缘 提交于 2019-11-30 14:24:23
I have a data.frame of cells, values and coordinates. It resides in the global environment. > head(cont.values) cell value x y 1 11117 NA -34 322 2 11118 NA -30 322 3 11119 NA -26 322 4 11120 NA -22 322 5 11121 NA -18 322 6 11122 NA -14 322 Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution. What my function actually does

How to calculate number of occurrences per minute for a large dataset

一世执手 提交于 2019-11-30 05:42:34
问题 I have a dataset with 500k appointments lasting between 5 and 60 minutes. tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000,

Initializing MPI cluster with snowfall R

早过忘川 提交于 2019-11-30 05:16:25
问题 I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node. Here's how I'm initializing it: sfInit(parallel=TRUE, cpus=10, type="MPI") Any ideas? I'll provide clarification as needed. 回答1: To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a