Multithread computation with R: how to get all different random numbers?

问题

Anyone knows how to get all the random numbers different in the following code? E.g. with doRNG package? I don't care about reproducibility.

Edit: Duplicates by pure chance are accepted.

rm(list = ls())
set.seed(666)
cat("\014")
library(plyr)
library(dplyr)
library(doRNG)

# ====== Data Preparation ======
dt = data.frame(id = 1:10,
                part = rep("dt",10),
                HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
                random = NA)

# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)

cl = makeCluster(3, outfile = "")
registerDoParallel(cl)

# ====== SIMULATION ======
nsim = 1000                # number of simulations
iterChunk = 100            # split nsim into this many chunks
out = data.frame()    # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
  strt = Sys.time()
  
  out_iter = 
    foreach(i = 1:iterChunk, .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
            .packages = c("plyr", "dplyr")) %dopar% {
              
              # simulation number
              id_sim = iterChunk * (iter - 1) + i

              ## Generate random numbers
              tmp_sim = is.na(dt$HG) # no results yet
              dt$random[tmp_sim] = runif(sum(tmp_sim))
              dt$HG[tmp_sim] = 3

              # Save Results
              dt$id_sim = id_sim
              dt$iter = iter
              dt$i = i
              
              print(Sys.time())
              return(dt)
            }#i;sim_forcycle
  
  out = rbind.data.frame(out,subset(out_iter, !is.na(random)))
  
  fnsh = Sys.time()
  cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter

# ====== Stop Parallel Computing ======
stopCluster(cl)

# ====== Distinct Random Numbers ======
length(unique(out$random))              # expectation: 6000

I have been strugling with this for 2 days. I asked this question earlier with only general response about random numbers.

Here I would like to ask for a solution (if anybody knows) how to set doRNG package options (or similar package) in a way that all the random numbers are different. Across all the loops.

I have tried tons of doRNG settings and I still can't get it to work. Tried R versions 3.5.3 and 3.6.3 on two different computers.

UPDATE Following discussion with @Limey

Purpose of the code is to simulate football matches. As the simulation is large, I use iterChunk to "split" the simulation into managable parts and after each iter send the data into PostgreSQL database so the simulation doesn't overload RAM. Some matches already have real world results and have HG (home goals) filled in. I want to simulate the rest.

When setting iterChunk to 1 everything is fine. Increasing iterChunk leads to generation of same numbers within iter. For example when I set nsim to 100 and iterChunk to 10. (All matches simulated 100 times, 10 times in 10 loops). I expect 600 random numbers (each match independently simulated accross all the loops). However I only get 180 - following the logic: 3 cores * 6 matches * 10 iterChunks.) Using 2 workers I do get 120 distinct random numbers (2 * 6 * 10)

Furthermore: exluding dt$HG[tmp_sim] = 3 I do get all random numbers different with whatever setting.

To understand the problem, I suggest:

Run the code as is. (possibly setting nsim to 100 and iterChunk to 10) You will get 180 different random numbers. With lower number of nsim & iterChunk things may work as expected.
Comment out dt$HG[tmp_sim] = 3. You will get 6000 different random numbers (600 if you change nsim and iterChunk)

The code in 2nd step assigns goals scored by home team. It looks like some kind of bug I can't get over. Even information that someone gets the same result and doesn't know why will be helpful - it will lift the weight of my own stupidity out of me.

Thank you, I highly appreciate any effort.

回答1:

I realised what the problem with OP's code was whilst I was in the shower. It's simple, and obvious in retrospect: all the loops and parallel processes are working on the same object - the dt data frame. So they're constantly overwriting the changes that each makes, and at the end of the outer loop, you just have multiple copies of the changes made by the last loop to complete. The solution is equally simple: work on a copy of the dt data frame.

To minimise the changes, I renamed dt to baseDT

# ====== Data Preparation ======
baseDT = data.frame(id = 1:10,
                part = rep("dt",10),
                HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
                random = NA)

and then took a copy of it at the top of the foreach loop

  out_iter = foreach(i = 1:iterChunk, 
               .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, 
               .inorder = FALSE, .verbose = FALSE,
               .packages = c("plyr", "dplyr")) %dopar% {
    dt <- baseDT

This gives

> length(unique(out$random))              # expectation: 6000
[1] 6000

as expected.

回答2:

Modifying the "Hello World" example in the "getting started with doParallel" vignette to generate random numbers, I came up with:

library(doParallel)

cl <- makeCluster(2)
registerDoParallel(cl)
myFunc <- function(n) {runif(n)}
foreach(i=1:3) %dopar% myFunc(10)
[[1]]
 [1] 0.18492375 0.13388278 0.65455450 0.93093066 0.41157625 0.89479764 0.14736529 0.47935995 0.03062963 0.16110714

[[2]]
 [1] 0.89245145 0.20980791 0.83828019 0.04411547 0.38184303 0.48110619 0.51509058 0.93732055 0.40159834 0.81414140

[[3]]
 [1] 0.74393129 0.66999730 0.44411989 0.85040773 0.80224527 0.72483644 0.64566262 0.22546420 0.14526819 0.05931329

Suggesting that getting random numbers across threads is straightforward. Indeed, the examples on pages 2 and 3 of the doRNG reference manual say the same thing.

In fact, if I understand you correctly, the purpose of doRNG is to do precisely the opposite of what you want: to make random processes reproducible across threads.

Of course, this doesn't guarantee that all numbers are different across all threads. But it makes duplication very unlikely. A guarantee of no duplicates would mean some degree of determinism in the process: a completely random process might produce duplicates by chance.

Update Following on from our conversation in the comments...

We've established that the problem is in your program logic, not the parallelisation per se. So we need to refocus the question: what are you trying to do. I'm afraid it's not at all clear to me. So that means we need to simplify.

I set nsim to 5 and iterChunk to 1. I get 5 data frames which look like

   id part HG    random id_sim iter i
1   1   dt  1        NA      1    1 1
2   2   dt  3        NA      1    1 1
3   3   dt  6        NA      1    1 1
4   4   dt  3 0.6919744      1    1 1
5   5   dt  3 0.5413398      1    1 1
6   6   dt  2        NA      1    1 1
7   7   dt  3 0.3983175      1    1 1
8   8   dt  3 0.3342174      1    1 1
9   9   dt  3 0.6126020      1    1 1
10 10   dt  3 0.4185468      1    1 1

In each, the values of id_sim and iter are always the same, and run from 1 in the first data frame to 5 in the fifth. i is 1 for all rows in all data frames. Values in random do appear to be random, and different between data frames. But the NAs are all in the same positions in every data frame: the 1st, 2nd, 3rd and 6th rows. The values of HG are as shown above for all five data frames.

Is that what you would expect? If not, what do you expect? Given we know the problem is not the paraellisation, you need to give us more information.

Update 2

Do you know Arduan? They posted a related question over the weekend...

I'm not going to tell you what's wrong with your code. I'll show you how I would apprach your problem. I hope you'll agree it's more readable, if nothing else.

So, we're simulating some football matches. I'll assume its a league format and use the english Premier League as an example. Start by generating the fixture list for a single season.

library(tidyverse)

teams <- c("Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion", 
           "Burnley", "Chelsea", "Crystal Palace", "Everton", "Leicester City", 
           "Liverpool", "Manchester City", "Manchester United", "Newcastle United", 
           "Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur", 
           "Watford", "West Ham United", "Wolverhampton Wanderers")

fixtures <- tibble(HomeTeam=teams, AwayTeam=teams) %>% 
              complete(HomeTeam, AwayTeam) %>% 
              filter(HomeTeam != AwayTeam) # A team can't play itself
fixtures %>% head(5)
# A tibble: 5 x 2
  HomeTeam AwayTeam              
  <chr>    <chr>                 
1 Arsenal  Aston Villa           
2 Arsenal  Bournemouth           
3 Arsenal  Brighton & Hove Albion
4 Arsenal  Burnley               
5 Arsenal  Chelsea

Suppose we know some results. I'll use yesterday's matches as an illustration.

knownResults <- tribble(~HomeTeam,          ~AwayTeam,         ~HomeGoals, ~AwayGoals,
                        "Burnley",          "Sheffield United",         1,          1,
                        "Newcastle United", "West Ham United",          2,          2,
                        "Liverpool",        "Aston Villa",              2,          0,
                        "Southampton",      "Manchester City",          1,          0)

resultsSoFar <- fixtures %>% 
             left_join(knownResults, by=c("HomeTeam", "AwayTeam"))
resultsSoFar %>% filter(!is.na(HomeGoals))
# A tibble: 4 x 4
  HomeTeam         AwayTeam         HomeGoals AwayGoals
  <chr>            <chr>                <dbl>     <dbl>
1 Burnley          Sheffield United         1         1
2 Liverpool        Aston Villa              2         0
3 Newcastle United West Ham United          2         2
4 Southampton      Manchester City          1         0

Now some utility functions. You could certainly combine them, but I think it's clearer to keep them separate so you can see exactly what each one is doing.

First, a function to simulate the results of all matches whose results are unknown. The details of how you simulate the scores are entirely arbitrary. I've assumed that home teams score an average of 1.5 goals a game, away teams score 1.2 goals per game. Later on, I'm going to use this to simulate many seasons in one go, so I'll add a variable (Iteration) to index the season.

simulateResults <- function(i=NA, data) {
  n <- nrow(data)
  data %>%
    add_column(Iteration=i, .before=1) %>% 
    mutate(
      # Give the home team a slight advantage
      HomeGoals=ifelse(is.na(HomeGoals), floor(rexp(n, rate=1/1.5)), HomeGoals),
      AwayGoals=ifelse(is.na(AwayGoals), floor(rexp(n, rate=1/1.2)), AwayGoals)
    )
}

Use it, and check that we haven't overwritten known results:

simulateResults(1, resultsSoFar) %>% filter(HomeTeam=="Burnley", AwayTeam=="Sheffield United")
# A tibble: 1 x 5
  Iteration HomeTeam AwayTeam         HomeGoals AwayGoals
      <dbl> <chr>    <chr>                <dbl>     <dbl>
1         1 Burnley  Sheffield United         1         1

I'm going to parallelise the overall simulation, so now let's have a function to simulate a chunk of simulations. Again, create an index column to identify the chunk.

simulateChunk <- function(chunkID=NA, n) {
  bind_rows(lapply(1:n, simulateResults, data=resultsSoFar)) %>% 
    add_column(Chunk=chunkID, .before=1)
}
simulateChunk(chunkID=1, n=3)
# A tibble: 1,140 x 6
   Chunk Iteration HomeTeam AwayTeam               HomeGoals AwayGoals
   <dbl>     <int> <chr>    <chr>                      <dbl>     <dbl>
 1     1         1 Arsenal  Aston Villa                    2         0
 2     1         1 Arsenal  Bournemouth                    0         0
 3     1         1 Arsenal  Brighton & Hove Albion         2         0
 4     1         1 Arsenal  Burnley                        2         0
 5     1         1 Arsenal  Chelsea                        1         0
 6     1         1 Arsenal  Crystal Palace                 0         0
 7     1         1 Arsenal  Everton                        2         3
 8     1         1 Arsenal  Leicester City                 2         0
 9     1         1 Arsenal  Liverpool                      0         1
10     1         1 Arsenal  Manchester City                4         0

OK. Now I'm ready to do the main simulation work. I'll run 10 chunks of 100 simulations eash, to give 1000 simulated seasons in total, the same as you had.

library(doParallel)

cl <- makeCluster(3)
registerDoParallel(cl)

chunkSize <- 100
nChunks <- 10

startedAt <- Sys.time()
x <- bind_rows(foreach(i=1:nChunks, .packages=c("tidyverse")) %dopar% simulateChunk(i, n=chunkSize))
finishedAt <- Sys.time()
print(finishedAt - startedAt)
Time difference of 6.772928 secs
stopCluster(cl)
> x
# A tibble: 380,000 x 6
   Chunk Iteration HomeTeam AwayTeam               HomeGoals AwayGoals
   <int>     <int> <chr>    <chr>                      <dbl>     <dbl>
 1     1         1 Arsenal  Aston Villa                    2         0
 2     1         1 Arsenal  Bournemouth                    3         1
 3     1         1 Arsenal  Brighton & Hove Albion         0         1
 4     1         1 Arsenal  Burnley                        3         0
 5     1         1 Arsenal  Chelsea                        1         0
 6     1         1 Arsenal  Crystal Palace                 0         0
 7     1         1 Arsenal  Everton                        1         2
 8     1         1 Arsenal  Leicester City                 0         0
 9     1         1 Arsenal  Liverpool                      0         0
10     1         1 Arsenal  Manchester City                0         0

Let's check I've got sensible results. As a basic check, I'll look at the results of Arsenal vs Aston Villa:

x %>% 
  filter(HomeTeam == "Arsenal", AwayTeam=="Aston Villa") %>% 
  group_by(HomeGoals, AwayGoals) %>% 
  summarise(N=n(), .groups="drop") %>% 
  pivot_wider(
    values_from="N", names_prefix="AwayGoals", 
    names_sep="", names_from=AwayGoals
  )
# A tibble: 8 x 10
  HomeGoals AwayGoals0 AwayGoals1 AwayGoals2 AwayGoals3 AwayGoals4 AwayGoals5 AwayGoals6 AwayGoals8 AwayGoals7
      <dbl>      <int>      <int>      <int>      <int>      <int>      <int>      <int>      <int>      <int>
1         0        299        129         57         19         12          7         NA         NA         NA
2         1        135         63         25          6          4          4          1          2         NA
3         2         75         21         12          9          4          1         NA         NA          1
4         3         30         13         10          1         NA         NA         NA         NA         NA
5         4         21          7          1          1         NA         NA         NA         NA         NA
6         5         11          2          1         NA          2         NA         NA         NA         NA
7         6          4          2          2         NA         NA         NA         NA         NA         NA
8         7          4          1          1         NA         NA         NA         NA         NA         NA

That looks reasonable. Now confirm that the matches with known results don't vary. For example:

x %>% 
  filter(HomeTeam == "Liverpool", AwayTeam=="Aston Villa") %>% 
  group_by(HomeGoals, AwayGoals) %>% 
  summarise(N=n(), .groups="drop") %>% 
  pivot_wider(values_from="N", names_prefix="AwayGoals", names_sep="", names_from=AwayGoals)
  HomeGoals AwayGoals0
      <dbl>      <int>
1         2       1000

All good.

So, That's 23 statements to generate the fixtures, take account of known results, simulate the remainder of the matches and do some basic sanity checking. I could easily get that down to under 20 statements if I had to. That's about a third less than you were using just to try to simulate the unknown results. [The actual simulation takes fewer than 10 statements.] I think my approach is easier to understand: by using tidy verbs the code is almost self-documenting.

来源：https://stackoverflow.com/questions/62758637/multithread-computation-with-r-how-to-get-all-different-random-numbers

标签

random

foreach

doparallel