问题
I have the following algorithm:
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
*comparisons=0;
#pragma omp parallel for schedule(static, 1) num_threads(1)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
(*comparisons)++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
return i;
}
When changing num_threads
I get the following results for number of comparisons:
- 01 = 9949051000
- 02 = 4992868032
- 04 = 2504446034
- 08 = 1268943748
- 16 = 776868269
- 32 = 449834474
- 64 = 258963324
Why is the number of comparisons not constant? It's interesting because the number of comparisons halves with the doubling of the number of threads. Is there some sort of race conditions going on for (*comparisons)++
where OMP just skips the increment if the variable is in use?
My current understanding is that the iterations of the k loop are split near-evenly amongst the threads. Each iteration has a private integer j as well as a private copy of integer k, and a non-parallel for loop which adds to the comparisons until terminated.
回答1:
You said it yourself (*comparisons)++;
has a race condition. It is a critical section that has to be serialized (I don't think (*pointer)++ is an atomic operation).
So basically you read the same value( i.e. 2) twice by two threads and then both increase it (3) and write it back. So you get 3 instead of 4. You have to make sure the operations on variables, that are not in the local scope of your parallelized function/loop, don't overlap.
回答2:
The naive way around the race condition to declare the operation as atomic update
:
#pragma omp atomic update
(*comparisons)++;
Note that a critical section here is unnecessary and much more expensive. An atomic update
can be declared on a primitive binary or unary operation on any l-value expression with scalar type.
Yet this is still not optimal because the value of *comparisons
needs to be moved around between CPU caches all the time and a expensive locked instruction is performed. Instead you should use a reduction. For that you need another local variable, the pointer won't work here.
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
long comparisons_tmp = 0;
#pragma omp parallel for reduction(comparisons_tmp:+)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
comparisons_tmp++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
*comparisons = comparisons_tmp;
return i;
}
P.S. schedule(static, 1)
seems like a bad idea, since this will lead to inefficient memory access patterns on textData
. Just leave it out and let the compiler do it's thing. If a measurement shows that it's not working efficiently, give it some better hints.
来源:https://stackoverflow.com/questions/42395568/openmp-why-does-the-number-of-comparisons-decrease