问题
I'm trying to do a parallel for inside a while, somothing like this:
while(!End){
for(...;...;...) // the parallel for
...
// serial code
}
The for loop is the only parallel section of the while loop. If I do this, I have a lot of overhead:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
Each iteration of the for loop are indepent with each other.
There are dependencies between serial code and parallel code.
回答1:
So normally one doesn't have to worry too much about putting parallel regions into loops, as modern openmp implementations are pretty efficient about using things like thread teams and as long as there's lots of work in the loop you're fine. But here, with an outer loop count of ~1e9 and an inner loop count of ~256 - and very little work being done per iteration - the overhead is likely comparable to or worse than the amount of work being done and performance will suffer.
So there will be a noticeable difference between this:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
and this:
cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
#pragma omp for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
#pragma omp single
{
++cycles;
}
}
But really, that scan across the time array every iteration is unfortunately both (a) slow and (b) not enough work to keep multiple cores busy - it's memory intensive. With more than a couple of threads you will actually have worse performance than serial, even without overheads, just because of memory contention. Admittedly what you have posted here is just an example, not your real code, but why don't you preprocess the time array so you can just check to see when the next task is ready to update:
#include <stdio.h>
#include <stdlib.h>
struct tasktime_t {
long int time;
int task;
};
int stime_compare(const void *a, const void *b) {
return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}
int main(int argc, char **argv) {
const int n=256;
const long int niters = 100000000l;
long int time[n];
int wbusy[n];
int wfinished[n];
for (int i=0; i<n; i++) {
time[i] = rand() % niters;
wbusy[i] = 1;
wfinished[i] = 0;
}
struct tasktime_t stimes[n];
for (int i=0; i<n; i++) {
stimes[i].time = time[i];
stimes[i].task = i;
}
qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);
long int cycles = 0;
int next = 0;
while(cycles < niters){ // 1k Million iterations aprox
while ( (next < n) && (stimes[next].time == cycles) ) {
int i = stimes[next].task;
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
next++;
}
++cycles;
}
return 0;
}
This is ~5 times faster than the serial version of the scanning approach (and much faster than the OpenMP versions). Even if you are constantly updating the time/wbusy/wfinished arrays in the serial code, you can keep track of their completion times using a priority queue with each update taking O(ln(N)) time instead of scanning every iteration taking O(N) time.
来源:https://stackoverflow.com/questions/26345002/parallel-for-inside-a-while-using-openmp-on-c