C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads

问题

I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).

int fib(int n)
  {
    if(n == 0 || n == 1)
        return n;
    if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
        return fib(n-1)+fib(n-2); 
    int res, a, b;
    #pragma omp task shared(a)
    a = fib(n-1);
    #pragma omp task shared(b)
    b = fib(n-2);
    #pragma omp taskwait
    res = a+b;
    return res;
  }

int main(){
  omp_set_nested(1);
  omp_set_num_threads(4);
  double start_time = omp_get_wtime();
  #pragma omp parallel
  {
    #pragma omp single
    {
      cout << fib(25) << endl;
    }
  }
  double time = omp_get_wtime() - start_time;
  std::cout << "Time(ms): " << time*1000 << std::endl;
  return 0;
}

回答1:

Have you tried it with a large number?

In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.

Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.

There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.

In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.

EDIT: cutoff variable needs to be increased before entering OMP zone.

the following code is for test purposes for the OP to test

#define CUTOFF 5
int fib_s(int n)
{
    if (n == 0 || n == 1)
        return n;
    int res, a, b;
    a = fib_s(n - 1);
    b = fib_s(n - 2);
    res = a + b;
    return res;
}
int fib_m(int n,int co)
{
    if (co >= CUTOFF) return fib_s(n);
    if (n == 0 || n == 1)
        return n;
    int res, a, b;
    co++;
#pragma omp task shared(a)
    a = fib_m(n - 1,co);
#pragma omp task shared(b)
    b = fib_m(n - 2,co);
#pragma omp taskwait
    res = a + b;
    return res;
}

int main()
{
    omp_set_nested(1);
    omp_set_num_threads(4);
    double start_time = omp_get_wtime();
#pragma omp parallel
    {
#pragma omp single
        {
            cout << fib_m(25,1) << endl;
        }
    }
    double time = omp_get_wtime() - start_time;
    std::cout << "Time(ms): " << time * 1000 << std::endl;
    return 0;
}

RESULT: With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.

co=1   14.5s
co=2    9.5s
co=3    6.4s
co=10   7.5s
co=15   7.0s 
co=20   8.5s
co=21 >18.0s
co=22 >40.0s

回答2:

I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).

So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time). Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x

#include <stdio.h>
#include <omp.h>
#include <math.h>

int fib(int n, int rec_height)
{
  int x = 1, y = 1;
  if (n < 2) 
      return n;
  int tCount = 0;

  if (rec_height > 0)   //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP)
  {
   rec_height -= 1;
  #pragma omp task shared(x)
  x = fib(n - 1, rec_height);
  #pragma omp task shared(y)
  y = fib(n - 2, rec_height);
  #pragma omp taskwait
  }
  else{
    x = fib(n - 1, rec_height);
    y = fib(n - 2, rec_height);
  }
  return x+y;

}


int main()
{
  int tot_thread = 16;
  int recDepth = (int)log2f(tot_thread);
  if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1;
  printf("\nrecDepth: %d\n",recDepth);
  omp_set_max_active_levels(recDepth);
  omp_set_nested(recDepth-1);

  int n,fibonacci;
  double starttime;
  printf("\nPlease insert n, to calculate fib(n): %d\n",n);
  scanf("%d",&n);
  omp_set_num_threads(tot_thread);
  starttime=omp_get_wtime();
  #pragma omp parallel
  {
   #pragma omp single
   {
    fibonacci=fib(n, recDepth);
   }
  }
  printf("\n\nfib(%d)=%d \n",n,fibonacci);
  printf("calculation took %lf sec\n",omp_get_wtime()-starttime);
  return 0;
}

来源：https://stackoverflow.com/questions/50340903/c-openmp-fibonacci-1-thread-performs-much-faster-than-4-threads

标签

c++

openmp