It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was
There is a lot of good information in the above, but the proper answer is the some optimizations MUST be turned off when compiling OpenMP. Some compilers, such as gcc, don't do that.
The example program at the end of this answer is searching for the value 81 in four non-overlapping ranges of integers. It should always find that value. However, on all gcc versions up to at least 4.7.2, the program sometimes does not terminate with the correct answer. To see for yourself, do this:
parsearch.cgcc -fopenmp -O2 parsearch.cOMP_NUM_THREADS=2 ./a.outAlternatively, you can compile without -O0, and see that the outcome is always correct.
Given that the program is free of race conditions, this behaviour of the compiler under -O2 is incorrect.
The behaviour is due to the global variable globFound. Please convince yourself that under expected execution, only one of the 4 threads in the parallel for writes to that variable. The OpenMP semantics define that if the global (shared) variable is written by only one thread, the value of the global variable after the parallel-for is the value that was written by that single thread. There is no communication between the threads through the global variable, and such would not be allowed as it gives rise to race conditions.
What the compiler optimization does under -O2 is that it estimated that writing to a global variable in a loop is expensive and therefore caches it in a register. This happens in the function findit, which, after optimization, will look like:
int tempo = globFound ;
for ( ... ) {
if ( ...) {
tempo = i;
}
globFound = tempo;
But with this 'optimized' code, every thread does read and write globFound, and a race condition is introduced by the compiler itself.
Compiler optimizations do need to be aware of parallel execution. Excellent material about this is published by Hans-J. Boehm, under the general topic of memory consistency.
#include
#define BIGVAL (100 * 1000 * 1000)
int globFound ;
void findit( int from, int to )
{
int i ;
for( i = from ; i < to ; i++ ) {
if( i*i == 81L ) {
globFound = i ;
}
}
}
int main( int argc, char *argv )
{
int p ;
globFound = -1 ;
#pragma omp parallel for
for( p = 0 ; p < 4 ; p++ ) {
findit( p * BIGVAL, (p+1) * BIGVAL ) ;
}
if( globFound == -1 ) {
printf( ">>>>NO 81 TODAY<<<<\n\n" ) ;
} else {
printf( "Found! N = %d\n\n", globFound ) ;
}
return 0 ;
}