How do I optimize the parallelization of Monte Carlo data generation with MPI?

问题

I am currently building a Monte Carlo application in C++ and I have a question regarding parallelization with MPI.

The process I want to parallelize is the MC generation of data. To have good precision in my final results, I specify the goal number of data points. Each data point is generated independently, but might require vastly differing amounts of time.

How do I organize the parallelization and workload distribution of the data generation most efficiently?

What I have done so far

So far I have come up with three possible ways of organizing the MPI part of the code:

The simplest way, but most likely inefficient way: I divide the desired sample size by the number of workers and let every worker generate that amount of data in isolation. However, when the slowest worker finishes, all other workers have been idling for a potentially long time. They could have been "supporting" the slowest worker by sharing its workload.
Use a master: A master communicates with the workers who work continuously until the master process registers that we have enough data and tells everybody to stop what they are doing. The disadvantage I see is that the master process might not be necessary and could be generating data instead (especially when I don't have a lot of workers).
A "ring communication" algorithm I came up with myself: A message is continuously sent and updated in a circle (1->2, 2->3, ... , N ->1). This message contains the global number of generated data point. Once the desired goal is met, the message is tagged, circles one more time and thereby tells everybody to stop working. Important here is I use non-blocking communication (with MPI_Iprobe before receiving via MPI_Recv, and sending via MPI_Isend). This way, everybody works, and no one ever idles.

No matter, which solution is chosen, in the end I reduce all data sets to one big set and continue to process the data.

The concrete questions:

Is there an "optimal" way of parallelizing such a fairly simple process? Would you prefer any of the proposed solutions for some reason?
What do you think of this "ring communication" solution?
I'm sure I'm not the first one to come up with e.g. the ring communication algorithm. I have tried to google this problem, but it seems to me that I do not know the right terminology in this context. I'm sure there must be a lot of material and literature on such simple algorithms, but I never had a formal course on MPI/parallelization. What are the "keywords" to look for?

Any advice and tips are much appreciated.

来源：https://stackoverflow.com/questions/62636497/how-do-i-optimize-the-parallelization-of-monte-carlo-data-generation-with-mpi

标签

c++

parallel-processing

mpi

openmpi

hpc