Does the following code just parallelize the first (outer) loops, or it parallelize the entire nested loops?
#pragma omp parallel for
for (int i=0;i&
You will be able to better understand this with the following example. Let's do this with two threads.
#pragma omp parallel for num_threads(2)
for(int i=0; i< 3; i++) {
for (int j=0; j< 3; j++) {
printf("i = %d, j= %d, threadId = %d \n", i, j, omp_get_thread_num());
}
}
then the result will be,
i = 0, j= 0, threadId = 0
i = 0, j= 1, threadId = 0
i = 0, j= 2, threadId = 0
i = 1, j= 0, threadId = 0
i = 1, j= 1, threadId = 0
i = 1, j= 2, threadId = 0
i = 2, j= 0, threadId = 1
i = 2, j= 1, threadId = 1
i = 2, j= 2, threadId = 1
That means, when you add #pragma omp parallel for to the uppermost for loop, the index of that for loop is divided among the threads. As you can see, when index of i is same the thread id is also the same.
Instead of that, we can parallel the combinations that we have in a nested for loop. In this example we can have following combinations of i and j.
i = 0, j= 0
i = 0, j= 1
i = 0, j= 2
i = 1, j= 0
i = 1, j= 1
i = 1, j= 2
i = 2, j= 0
i = 2, j= 1
i = 2, j= 2
In order to parallelize the code combination wise, we can add the collapse keyword as follows.
#pragma omp parallel for num_threads(2) collapse(2)
for(int i=0; i< 3; i++) {
for (int j=0; j< 3; j++) {
printf("i = %d, j= %d, threadId = %d \n", i, j, omp_get_thread_num());
}
}
then the result will be as follows.
i = 0, j= 0, threadId = 0
i = 0, j= 1, threadId = 0
i = 1, j= 2, threadId = 1
i = 2, j= 0, threadId = 1
i = 2, j= 1, threadId = 1
i = 2, j= 2, threadId = 1
i = 0, j= 2, threadId = 0
i = 1, j= 0, threadId = 0
i = 1, j= 1, threadId = 0
Then you can see that unlike before, for the same index i, there can be different thread ids ( when (i=1 and j=2 threadId=1) also (i=1 and j=0 threadId=0)). That means in this scenario, the combinations of i and j are divided among the threads.
The lines you have written will parallelize only the outer loop. To parallelize both you need to add a collapse
clause:
#pragma omp parallel for collapse(2)
for (int i=0;i<N;i++)
{
for (int j=0;j<M;j++)
{
//do task(i,j)//
}
}
You may want to check OpenMP 3.1 specifications (sec 2.5.1) for more details.
OpenMP only parallelizes the loop next to the pragma. You can parallelize the inner loop too if you want to but it won't be done automatically.