Openmp nested loop

狂风中的少年 提交于 2019-12-06 07:56:20

because the parallel region is only created once and not n-times like the second?

Kind of. The construction

#pragma omp parallel
{
}

also means allocating work items to threads on '{' and returning threads into thread pool on '}'. It has a lot of thread-to-thread communication. Also, by default waiting threads will go to sleep via OS and some time will be needed for wake thread.

About your middle sample: You can try to limit outer for's parallelity with...

#pragma omp parallel private(i,k)
{
for(i=0;i<n;i++) //w'ont be parallelized
{
  #pragma omp for
  for(j=i+1;j<n,j++)  //will be parallelized
  {
    doing sth.
  }
  #pragma omp for    
  for(j=i+1;j<n;j++)  //will be parallelized
    for(k = i+1;k<n;k++)
    {
      doing sth.
    }
  // Is there really nothing? - if no - use:
  // won't be parallelized
  #pragma omp single
  { //seq part of outer loop
      printf("Progress... %i\n", i); fflush(stdout);
  }

  // here is the point. Every thread did parallel run of outer loop, but...
  #pramga omp barrier

  //  all loop iterations are syncronized:
  //       thr0   thr1  thr2
  // i      0      0     0
  //     ----   barrier ----
  // i      1      1     1
  //     ----   barrier ----
  // i      2      2     2
  // and so on
}
}

In general, placing parallelity at highest (upper) possible for of for nest is better than placing it on inner loops. If you need sequential execution of some code, use the advanced pragmas (like omp barrier, omp master or omp single) or omp_locks for this code. Any of this way will be faster than starting omp parallel many times

Your full test is very wrong. You did count time for both parts of code and for second one; not the time of first section. Also, second line of printf did measure time of first printf.

First run is very slow because there is a thread startup time here, memory init and cache effects. Also, the heuristics of omp may be autotuned after several parallel regions

My version of your test:

$ cat test.c
#include <stdio.h>
#include <omp.h>

void test( int n, int j)
{
  int i ;
  double t_a = 0.0, t_b = 0.0, t_c = 0.0 ;
  t_a = omp_get_wtime() ;
  #pragma omp parallel
  {
    for(i=0;i<n;i++) { }
  }
  t_b = omp_get_wtime() ;
  for(i=0;i<n;i++) {
    #pragma omp parallel
    { }
  }
  t_c = omp_get_wtime() ;
  printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_b-t_a)) ;
  printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_c-t_b)) ;
}

int main(void)
{
  int i, n, j=3  ;
  double t_1 = 0.0, t_2 = 0.0, t_3 = 0.0;
  printf( "Input n: " ) ;
  scanf( "%d", &n ) ;
  while( j --> 0 ) {
      t_1 = omp_get_wtime();
      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

      t_2 = omp_get_wtime();

      for(i=0;i<n;i++) {
        #pragma omp parallel
        { }
      }
      t_3 = omp_get_wtime();
      printf( "[%i] directive outside for-loop: %lf\n", j, 1000*(t_2-t_1)) ;
      printf( "[%i] directive inside for-loop: %lf \n", j, 1000*(t_3-t_2)) ;
      test(n,j) ;
  }
  return 0 ;
}

I did 3 runs for every n inside the program itself.

Results:

$ ./test
Input n: 1000
[2] directive outside for-loop: 5.044824
[2] directive inside for-loop: 48.605116
[2] directive outside for-loop: 0.115031
[2] directive inside for-loop: 1.469195
[1] directive outside for-loop: 0.082415
[1] directive inside for-loop: 1.455855
[1] directive outside for-loop: 0.081297
[1] directive inside for-loop: 1.462352
[0] directive outside for-loop: 0.080528
[0] directive inside for-loop: 1.455786
[0] directive outside for-loop: 0.080807
[0] directive inside for-loop: 1.467101

Only first run of test() is affected. All next results are the same for test and main().

Better and more stable results are from such run (I used gcc-4.6.1 and static build)

$ OMP_WAIT_POLICY=active GOMP_CPU_AFFINITY=0-15 OMP_NUM_THREADS=2  ./test
Input n: 5000
[2] directive outside for-loop: 0.079412
[2] directive inside for-loop: 4.266087
[2] directive outside for-loop: 0.031708
[2] directive inside for-loop: 4.319727
[1] directive outside for-loop: 0.047563
[1] directive inside for-loop: 4.290812
[1] directive outside for-loop: 0.033733
[1] directive inside for-loop: 4.324406
[0] directive outside for-loop: 0.047004
[0] directive inside for-loop: 4.273143
[0] directive outside for-loop: 0.092331
[0] directive inside for-loop: 4.279219

I did set two omp performance environment variables and limited thread number to 2.

Also. You "paralleled" loop is wrong. (and I reproduced this error in my ^^^ variant) The i variable is shared here:

      #pragma omp parallel
      {
        for(i=0;i<n;i++) { }
      }

You should have it as

      #pragma omp parallel
      {
        for(int local_i=0;local_i<n;local_i++) { }
      }

UPDATE7 Your result is for n=1000:

[2] directive inside for-loop: 0.001188 
[1] directive outside for-loop: 0.021092
[1] directive inside for-loop: 0.001327 
[1] directive outside for-loop: 0.005238
[1] directive inside for-loop: 0.001048 
[0] directive outside for-loop: 0.020812
[0] directive inside for-loop: 0.001188 
[0] directive outside for-loop: 0.005029
[0] directive inside for-loop: 0.001257 

the 0.001 or 0.02 output of your code is .... the seconds multiplied by 1000, so it is a millisecond (ms). And it is ... around 1 microsecond or 20 microseconds. The granularity of some system clocks (user time or system time output fields of time utility) are from 1 millisecond, 3 ms or 10 ms. 1 microsecond is 2000-3000 CPU ticks (for 2-3GHz CPU). So you can't measure so short time interval without special setup. You should:

  1. disable energy saving of CPU (Intel SpeedStep, AMD ???), which can put CPU in lower-power state by lowering its clock (frequency);
  2. disable dynamic overclocking of CPU (Intel turbostep);
  3. Measure time without help from OS, e.g. by reading TSC (rdtsc asm instruction)
  4. Disable instruction reordering on Out-Of-Order CPUs (only atom is not OOO cpu of current generation) before and after rdtsc by adding a cpuid instruction (or other instruction that will disable reordering)
  5. Do the run on completely free system (0% cpu load on both cpu before you will start a test)
  6. rewrite test in non-interactive way (don't wait user for input with scanf, pass n via argv[1])
  7. Don't use Xserver and slow terminal to output the results
  8. Make interrupts number lower (turn off network, physically; don't play a film in background, don't touch mouse and keyboard)
  9. Do a lot of runs (I mean not program restarting, but restarting of measured part of program; j=100 in my program) and add statistic calculation over results.
  10. Don't run a printf so often (between measures); it will pollute the caches and TLB. Store results internally and output them after all measurements are done.

UPDATE8: As statistical I mean: take several values, 7 or more. Discard first value (or even 2-3 first values if you has high number of values measured). Sort them. Discard ... 10-20 % of maximum and minimum results. Calculate the mean. Literally

double results[100], sum=0.0, mean = 0.0;
int count = 0;
// sort results[5]..results[100] here
for(it=20; it< 85; it ++) {
  count++; sum+= results[it];
}
mean = sum/count;
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!