I have a simple program that I am using for physics simulation. I want to know how to implement a certain threading paradigm in OpenMP.
int main() { #define steps (100000) for (int t = 0;t < steps; t++) { firstParallelLoop(); secondParallelLoop(); if (!(t%100)) { checkpoint(); } } } void firstParallelLoop() {// In another file.c #pragma omp parallel for for (int i = 0; i < sizeOfSim;i++) { //Some atomic floating point ops. } }
Formerly, I was using pthreads and got a 1.7 speedup on my dualcore laptop. I can't seem to get any speedup when using OpenMP. I suspect the problem is that the thread groups/pools are rapidly being created and destroyed with disasterous effect.
In my pthreads implementations I needed to ensure that no new threads were created, and that my program behaved as a client-server. In the pthreads scheme, the main() was a server and calls to firstParallelLoop would release mutexes/semaphores that triggered the thread to reprocess the data.
When I look at CPU utilization I expect it to be over the 30% mark (4 core, 2 are HT), but it stays around 27...
How do I get OpenMP to do something similar? How can I tell OpenMP to reuse my threads?
The GCC OpenMP run-time libgomp
implements thread teams on POSIX systems by something akin to a thread pool - threads are only created when the first parallel region is encountered, with each thread running an infinite work loop. Entering and exiting a parallel region is implemented with barriers. By default libgomp
uses a combination of busy-waiting and sleeping to implement barriers. The amount of busy-waiting is controlled by the OMP_WAIT_POLICY
environment variable. If it is not specified, threads that wait on a barrier would busy-wait for 300000 spins (3 ms at 100000 spins/msec) and then would go into sleeping state. If OMP_WAIT_POLICY
is set to active
, then the busy-wait time is increased to 30000000000 spins (5 mins at 100000 spins/sec). You can fine tune the busy-waiting time by setting the GOMP_SPINCOUNT
variable to the number of busy cycles (libgomp
assumes about 100000 spins/msec but it could vary by a factor of 5 depending on the CPU). You can fully disable sleeping like this:
OMP_WAIT_POLICY=active GOMP_SPINCOUNT=infinite OMP_NUM_THREADS=... ./program
This would somehow improve the thread team starting time, but at the expense of CPU time as idle threads would not idle but rather busy-wait.
In order to remove the overhead you should rewrite your program in more OpenMP-friendly way. Your example code could be rewritten like this:
int main() { #define steps (100000) #pragma omp parallel { for (int t = 0; t < steps; t++) { firstParallelLoop(); secondParallelLoop(); if (!(t%100)) { #pragma omp master checkpoint(); #pragma omp barrier } } } } void firstParallelLoop() {// In another file.c #pragma omp for for (int i = 0; i < sizeOfSim; i++) { //Some atomic floating point ops. } }
Note the following two things:
- A parallel region is inserted in the main program. It is not a
parallel for
though. All threads in the team would execute the outer loop steps
times. - The
for
loop in firstParallelLoop
is made parallel by using omp for
only. Thus it will execute as a serial loop if called outside an OpenMP parallel and as parallel when called from inside a parallel region. The same should be done for the loop in secondParallelLoop
.
The barrier in the main loop is used to ensure that other threads would wait for the checkpoint to finish before starting the next iteration.