gpu-warp | 易学教程

Is CUDA warp scheduling deterministic?

阅读更多关于 Is CUDA warp scheduling deterministic?

问题 I am wondering if the warp scheduling order of a CUDA application is deterministic. Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device. If not, is there anything that could force ordering of warp execution (say in the case when debugging an order dependent algorithm)? 回答1: The precise behavior of CUDA warp scheduling is not defined. Therefore you cannot depend on it being

How does a GPU group threads into warps/wavefronts?

阅读更多关于 How does a GPU group threads into warps/wavefronts?

问题 My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block? For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index? Since by doing this, one can minimize

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

阅读更多关于 Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

问题 In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync() . Is that an alias or have the semantics changed? ... similar question for other builtins which now have __sync() added to their names. 回答1: No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the

Thread/warp local lock in cuda

阅读更多关于 Thread/warp local lock in cuda

问题 I want to implement critical sections in cuda. I read many questions and answers on this subject, and answers often involve atomicCAS and atomicExch. However, this doesn't work at warp level, since all threads in the warp acquire the same lock after the atomicCAS, leading to a deadlock. I think there is a way to have a real lock in cuda by using warp __ballot or __any instructions. However, after many attempts, I don't get to a satisfying (read working) solution. Does anyone here have a good

Do the threads in a CUDA warp execute in parallel on a multiprocessor?

阅读更多关于 Do the threads in a CUDA warp execute in parallel on a multiprocessor?

问题 A warp is 32 threads. Does the 32 threads execute in parallel in a Multiprocessor? If 32 threads are not executing in parallel then there is no race condition in the warp. I got this doubt after going through the some examples. 回答1: In the CUDA programming model, all the threads within a warp run in parallel. But the actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture have 8 cores

How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

阅读更多关于 How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

问题 Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1: In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point operations. Each warp is composed of 32 consecutive threads. Each SM has 2 warp schedulers . At every instruction issue time, one warp scheduler picks a ready warp of threads and issues 2 instructions for the warp on the cores. My doubts: One thread will

Questions of resident warps of CUDA

阅读更多关于 Questions of resident warps of CUDA

问题 I have been using CUDA for a month, now i'm trying to make it clear that how many warps/blocks are needed to hide the latency of memory accesses. I think it is related to the maximum of resident warps on a multiprocessor. According to Table.13 in CUDA_C_Programming_Guide (v-7.5),the maximum of resident warps per multiprocessor is 64. Then, my question is : what is the resident warp? is it refer to those warps with the data read from memory of GPUs and are ready to be processed by SPs? Or

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

阅读更多关于 When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

问题 nvcc device code has access to a built-in value, warpSize , which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5) So, at least for that purpose you are motivated to have something like ( edit ): enum : unsigned int { warp_size = 32 }; somewhere in your headers. But now - which

Do the threads in a CUDA warp execute in parallel on a multiprocessor?

阅读更多关于 Do the threads in a CUDA warp execute in parallel on a multiprocessor?

Removing __syncthreads() in CUDA warp-level reduction

阅读更多关于 Removing __syncthreads() in CUDA warp-level reduction

问题 The following code sums every 32 elements in an array to the very first element of each 32 element group: int i = threadIdx.x; int warpid = i&31; if(warpid < 16){ s_buf[i] += s_buf[i+16];__syncthreads(); s_buf[i] += s_buf[i+8];__syncthreads(); s_buf[i] += s_buf[i+4];__syncthreads(); s_buf[i] += s_buf[i+2];__syncthreads(); s_buf[i] += s_buf[i+1];__syncthreads(); } I thought I can eliminate all the __syncthreads() in the code, since all the operations are done in the same warp. But if I