Why my parallel code using openMP atomic takes a longer time than serial code?

时光毁灭记忆、已成空白 提交于 2021-02-10 15:51:01

问题


The snippet of my serial code is shown below.

 Program main
  use omp_lib
  Implicit None
   
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0

  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

    Do i = 1, 100000000
      a = a + Real(i)
    End Do

  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

The elapsed time:

By using omp directives do and atomic, I convert serial code into parallel code. However, the parallel program is slower than the serial program. I don't understand why this happened. The next is my parallel code snippet:

Program main
  use omp_lib
  Implicit None
    
  Integer, Parameter :: n_threads = 8
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0
 
  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

  !$OMP Parallel Num_threads(n_threads) shared(a)
  
   !$OMP Do 
     Do i = 1, 100000000
       !$OMP Atomic
       a = a + Real(i)
     End Do
   !$OMP End Do
  
  !$OMP End Parallel
  
  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

The elapsed time:

So my question is Why my parallel code using openMP atomic takes a longer time than serial code?


回答1:


You are applying an atomic operation to the same variable in every single loop iteration. Moreover, that variable has interdependencies among those loop iterations. Naturally, that comes with additional overheads (e.g., synchronization, cost of serialization, and CPU cycles) when comparing with the sequential version. Furthermore, you are probably getting a lot of cache misses due to threads invalidating their caches.

This code is the typical code that should be using a reduction of the variable a (i.e., !$omp parallel do reduction(+:a)) instead of an atomic operation. With the reduction operation, each thread will have a private copy of the variable 'a', and at end of the parallel region, threads will reduce their copies of the variable 'a' (using the '+' operator) into a single value that will be propagated to the variable 'a' of the main thread.

You can find a more detailed answer about the differences between atomic vs. reduction on this SO thread. In that thread, there is even a code, which (just like yours) its atomic version is several orders of magnitude slower than its sequential counterpart (i.e., 20x slower). In that case it is even worst than yours (i.e., 20x Vs 10x).



来源:https://stackoverflow.com/questions/64823158/why-my-parallel-code-using-openmp-atomic-takes-a-longer-time-than-serial-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!