Branch divergence, CUDA and Kinetic Monte Carlo

笑着哭i 提交于 2020-01-12 09:57:12

问题


So, I have a code that uses Kinetic Monte Carlo on a lattice in order to simulate something. I am using CUDA to run this code on my GPU (although I believe the same question applies to OpenCl as well).

This means that I divide my lattice into little sub-lattices and each thread operates on one of them. Since I am doing KMC, each thread has this code :

   While(condition == true){
     *Grab a sample u from U[0,1]*
      for(i = 0; i < 100;i++){
         *Do some stuff here to generate A*
          if(A > u){
              *Do more stuff here, which could include updates to global memory*
               break();
           }
      }
   }

A is different for different threads and so is u and 100 is just a random number. In code, this could be 1000 or even 10000.

So, won't we have branch divergence when the time comes for a thread to pass through that if? How badly can this affect performance? I know that the answer depends on the code inside the if-clause but how will this scale as I add more and more threads?

Any reference on how I can estimate losses/gains in performance would also be welcome.

Thanks!


回答1:


The GPU runs threads in groups of 32 threads, called warps. Divergence can only happen within a warp. So, if you are able to arrange your threads in such a way that the if condition evaluates the same way in the entire warp, there is no divergence.

When there is divergence in an if, conceptually, the GPU simply ignores the results and memory requests from threads in which the if condition was false.

So, say that the if evaluates to true for 10 of the threads in a particular warp. While inside that if, the potential compute performance of the warp is reduced from 100% to 10 / 32 * 100 = 31%, as the 22 threads that got disabled by the if could have been doing work but are now just taking up room in the warp.

Once exiting the if, the disabled threads are enabled again, and the warp proceeds with a 100% potential compute performance.

An if-else behaves in much the same way. When the warp gets to the else, the threads that were enabled in the if become disabled, and the ones that were disabled become enabled.

In a for loop that loops a different number of times for each thread in the warp, threads are disabled as their iteration counts reach their set numbers, but the warp as a whole must keep looping until the thread with the highest iteration count is done.

When looking at potential memory throughput, things are a little bit more complicated. If an algorithm is memory bound, there might not be much or any performance lost due to warp divergence, because the number of memory transactions may be reduced. If each thread in the warp was reading from an entirely different location in global memory (a bad situation for a GPU), time would be saved for each of the disabled threads as their memory transactions would not have to be performed. On the other hand, if the threads were reading from an an array that had been optimized for access by the GPU, multiple threads would be sharing the results from a single transaction. In that case, the values that were meant for the disabled threads were read from memory and then discarded together with the computations the disabled thread could have done.

So, now you probably have enough of an overview to be able to do pretty good judgement calls as to how much warp divergence is going to affect your performance. The worst case is when only a single thread in a warp is active. Then you get 1/32 = 3.125% of the potential for compute bound performance. Best case is 31/32 = 96.875%. For an if that is fully random, you get 50%. And as mentioned, memory bound performance depends on the change in the number of required memory transactions.



来源:https://stackoverflow.com/questions/10980593/branch-divergence-cuda-and-kinetic-monte-carlo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!