Openmp nested loop | 易学教程

just playing around with openmp. Look at this code fragments:

#pragma omp parallel
{
    for( i =0;i<n;i++)
    {
        doing something
    }
}

and

for( i =0;i<n;i++)
{
  #pragma omp parallel
  {
     doing something
  }
}

Why is the first one a lot more slower (around the factor 5) than the second one? From theory I thought that the first one must be faster, because the parallel region is only created once and not n-times like the second? Can someone explain this to me?

The code i want to parallelise has the following structure:

for(i=0;i<n;i++) //wont be parallelizable
{
  for(j=i+1;j<n;j++)  //will be parallelized
  {
    doing sth.
  }

  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }

}

I made a simple program to measure the time and reproduce my results.

#include <stdio.h>
#include <omp.h>

void test( int n)
{
  int i ;
  double t_a = 0.0, t_b = 0.0 ;


  t_a = omp_get_wtime() ;

  #pragma omp parallel
  {
    for(i=0;i<n;i++)
    {

    }
  }

  t_b = omp_get_wtime() ;

  for(i=0;i<n;i++)
  {
    #pragma omp parallel
    {
    }
  }

  printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_a)) ;
  printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_b)) ;
}

int main(void)
{
  int i, n   ;
  double t_1 = 0.0, t_2 = 0.0 ;

  printf( "n: " ) ;
  scanf( "%d", &n ) ;

  t_1 = omp_get_wtime() ;

  #pragma omp parallel
  {
    for(i=0;i<n;i++)
    {

    }
  }

  t_2 = omp_get_wtime() ;

  for(i=0;i<n;i++)
  {
    #pragma omp parallel
    {
    }
  }

  printf( "directive outside for-loop: %lf\n", 1000*(omp_get_wtime()-t_1)) ;
  printf( "directive inside for-loop: %lf \n", 1000*(omp_get_wtime()-t_2)) ;

  test(n) ;

  return 0 ;
}

If I start it with different n's I always get different results.

n: 30000
directive outside for-loop: 0.881884
directive inside for-loop: 0.073054 
directive outside for-loop: 0.049098
directive inside for-loop: 0.011663 

n: 30000
directive outside for-loop: 0.402774
directive inside for-loop: 0.071588 
directive outside for-loop: 0.049168
directive inside for-loop: 0.012013 

n: 30000
directive outside for-loop: 2.198740
directive inside for-loop: 0.065301 
directive outside for-loop: 0.047911
directive inside for-loop: 0.012152 



n: 1000
directive outside for-loop: 0.355841
directive inside for-loop: 0.079480 
directive outside for-loop: 0.013549
directive inside for-loop: 0.012362 

n: 10000
directive outside for-loop: 0.926234
directive inside for-loop: 0.071098 
directive outside for-loop: 0.023536
directive inside for-loop: 0.012222 

n: 10000
directive outside for-loop: 0.354025
directive inside for-loop: 0.073542 
directive outside for-loop: 0.023607
directive inside for-loop: 0.012292

How can you explain me this difference?!

Results with your version:

Input n: 1000
[2] directive outside for-loop: 0.331396
[2] directive inside for-loop: 0.002864 
[2] directive outside for-loop: 0.011663
[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257

because the parallel region is only created once and not n-times like the second?

Kind of. The construction

#pragma omp parallel
{
}

also means allocating work items to threads on '{' and returning threads into thread pool on '}'. It has a lot of thread-to-thread communication. Also, by default waiting threads will go to sleep via OS and some time will be needed for wake thread.

About your middle sample: You can try to limit outer for's parallelity with...

#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
  #pragma omp for
  for(j=i+1;j<n,j++)  //will be parallelized
  {
    doing sth.
  }
  #pragma omp for    
  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }
  // Is there really nothing? - if no - use:
  // won't be parallelized
  #pragma omp single
  { //seq part of outer loop
      printf("Progress... %i\n", i); fflush(stdout);
  }

  // here is the point. Every thread did parallel run of outer loop, but...
  #pramga omp barrier

  //  all loop iterations are syncronized:
  //       thr0   thr1  thr2
  // i      0      0     0
  //     ----   barrier ----
  // i      1      1     1
  //     ----   barrier ----
  // i      2      2     2
  // and so on
}
}

In general, placing parallelity at highest (upper) possible for of for nest is better than placing it on inner loops. If you need sequential execution of some code, use the advanced pragmas (like omp barrier, omp master or omp single) or omp_locks for this code. Any of this way will be faster than starting omp parallel many times

Your full test is very wrong. You did count time for both parts of code and for second one; not the time of first section. Also, second line of printf did measure time of first printf.

First run is very slow because there is a thread startup time here, memory init and cache effects. Also, the heuristics of omp may be autotuned after several parallel regions

My version of your test:

$ cat test.c
#include <stdio.h>
#include <omp.h>

void test( int n, int j)
{
  int i ;
  double t_a = 0.0, t_b = 0.0, t_c = 0.0 ;
  t_a = omp_get_wtime() ;
  #pragma omp parallel
  {
    for(i=0;i<n;i++) { }
  }
  t_b = omp_get_wtime() ;
  for(i=0;i<n;i++) {
    #pragma omp parallel
    { }
  }
  t_c = omp_get_wtime() ;
  printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_b-t_a)) ;
  printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_c-t_b)) ;
}

int main(void)
{
  int i, n, j=3  ;
  double t_1 = 0.0, t_2 = 0.0, t_3 = 0.0;
  printf( "Input n: " ) ;
  scanf( "%d", &n ) ;
  while( j --> 0 ) {
      t_1 = omp_get_wtime();
      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

      t_2 = omp_get_wtime();

      for(i=0;i<n;i++) {
        #pragma omp parallel
        { }
      }
      t_3 = omp_get_wtime();
      printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_2-t_1)) ;
      printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_3-t_2)) ;
      test(n,j) ;
  }
  return 0 ;
}

I did 3 runs for every n inside the program itself.

Results:

$ ./test
Input n: 1000
[2] directive outside for-loop: 5.044824
[2] directive inside for-loop: 48.605116
[2] directive outside for-loop: 0.115031
[2] directive inside for-loop: 1.469195
[1] directive outside for-loop: 0.082415
[1] directive inside for-loop: 1.455855
[1] directive outside for-loop: 0.081297
[1] directive inside for-loop: 1.462352
[0] directive outside for-loop: 0.080528
[0] directive inside for-loop: 1.455786
[0] directive outside for-loop: 0.080807
[0] directive inside for-loop: 1.467101

Only first run of test() is affected. All next results are the same for test and main().

Better and more stable results are from such run (I used gcc-4.6.1 and static build)

$ OMP_WAIT_POLICY=active GOMP_CPU_AFFINITY=0-15 OMP_NUM_THREADS=2  ./test
Input n: 5000
[2] directive outside for-loop: 0.079412
[2] directive inside for-loop: 4.266087
[2] directive outside for-loop: 0.031708
[2] directive inside for-loop: 4.319727
[1] directive outside for-loop: 0.047563
[1] directive inside for-loop: 4.290812
[1] directive outside for-loop: 0.033733
[1] directive inside for-loop: 4.324406
[0] directive outside for-loop: 0.047004
[0] directive inside for-loop: 4.273143
[0] directive outside for-loop: 0.092331
[0] directive inside for-loop: 4.279219

I did set two omp performance environment variables and limited thread number to 2.

Also. You "paralleled" loop is wrong. (and I reproduced this error in my ^^^ variant) The i variable is shared here:

      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

You should have it as

      #pragma omp parallel
      {
        for(int local_i=0;local_i<n;local_i++) { }
      }

UPDATE7 Your result is for n=1000:

[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257

the 0.001 or 0.02 output of your code is .... the seconds multiplied by 1000, so it is a millisecond (ms). And it is ... around 1 microsecond or 20 microseconds. The granularity of some system clocks (user time or system time output fields of time utility) are from 1 millisecond, 3 ms or 10 ms. 1 microsecond is 2000-3000 CPU ticks (for 2-3GHz CPU). So you can't measure so short time interval without special setup. You should:

disable energy saving of CPU (Intel SpeedStep, AMD ???), which can put CPU in lower-power state by lowering its clock (frequency);
disable dynamic overclocking of CPU (Intel turbostep);
Measure time without help from OS, e.g. by reading TSC (rdtsc asm instruction)
Disable instruction reordering on Out-Of-Order CPUs (only atom is not OOO cpu of current generation) before and after rdtsc by adding a cpuid instruction (or other instruction that will disable reordering)
Do the run on completely free system (0% cpu load on both cpu before you will start a test)
rewrite test in non-interactive way (don't wait user for input with scanf, pass n via argv[1])
Don't use Xserver and slow terminal to output the results
Make interrupts number lower (turn off network, physically; don't play a film in background, don't touch mouse and keyboard)
Do a lot of runs (I mean not program restarting, but restarting of measured part of program; j=100 in my program) and add statistic calculation over results.
Don't run a printf so often (between measures); it will pollute the caches and TLB. Store results internally and output them after all measurements are done.

UPDATE8: As statistical I mean: take several values, 7 or more. Discard first value (or even 2-3 first values if you has high number of values measured). Sort them. Discard ... 10-20 % of maximum and minimum results. Calculate the mean. Literally

double results[100], sum=0.0, mean = 0.0;
int count = 0;
// sort results[5]..results[100] here
for(it=20; it< 85; it ++) {
  count++; sum+= results[it];
}
mean = sum/count;

来源：https://stackoverflow.com/questions/7149606/openmp-nested-loop

标签

loops

nested

openmp