GPGPU: Consequence of having a common PC in a warp

问题

I read in a book that in a wavefront or warp, all threads share a common program counter. So what is its consequence? Why does that matter?

回答1:

NVIDIA GPUs execute 32-threads at a time (warps) and AMD GPUs execute 64-threads at time (wavefronts). The sharing of control logic, fetch, and data paths reduces area and increases perf/area and perf/watt.

In order to take advantage of the design programming languages and developers need to understand how to coalesce memory accesses and how to manage control flow divergence. If each thread in a warp/wavefront takes a different execution path or if each thread accesses significantly divergent memory then the benefits of the design are lost and performance will significantly degrade.

回答2:

This means that all threads run the same commands at the same time. This is very important for insuring that all threads have completed the previous line when processing the current line. For instance if you need to pass data from one thread to another you need to make sure that the data was already written by the first thread. Because the program counter is shared you know that once the write data line completes the data exists in all threads.

回答3:

As some of the other answers have stated, the threads (warps/wavefronts) are executed in sync with each other on a per-workgroup basis. To a developer this means that you need to pay special attention to any branching / conditional logic, because if at least one work item in a group hits the 'else' condition, all other work items pause while that code is executed.

So why would gpu manufacturers want to do this? The lack of individual program counters, branch prediction, and large cache memory save a lot of silicon for more Arithmetic Logic Units (ALUs) in the chip. More ALUs equals more work groups or concurrent threads.

Related: CPU vs GPU hardware.

回答4:

As usual, knowing how things work under the hood helps you to increase performance. From the OCL developer point of view we only know that

The work-items in a given work-group execute concurrently on the processing elements of a single compute unit. (OCL specs 1.2 - section 3.2).

This and the way SIMT architecture works nowadays leads to this kind of consideration when speaking about branches (from this post):

Executing both branches happens only if the condition is not coherent between threads in a local work group, that means if the condition evaluates to different values between work items in a local work group, current generation GPUs will execute both branches, but only the correct branches will write values and have side effects.

This is quite correct but doesn't give you any inside on how to avoid divergence (note that here we're still at the work-group level).

But knowing that a work-group is composed of 1 or more warps within which work-items are sharing a PC (not at the work-group level) can sometimes help you to avoid divergence. It's only if some work-items within a warp take different paths that you'll have divergence (both branches are executed). Consider this (source):

if (threadIdx.x > 2) {...} else {...}

and this:

if (threadIdx.x / WARP_SIZE > 2) {...} else {...}

In the first case there will be divergence within the first warp (of 32 threads for NVIDIA). But not in the second case where it'll always be a multiple of the warp size whatever the size of the work-group. Obviously these 2 examples do not do the same thing. But in some case you might be able to rearrange your data (or find another trick) to keep the philosophy of the second example.

This seems remote from the reality but a real life example is reduction. By ordering your operation in a "SIMD friendly structure" you can at each stage drop some warps (hence let the room for some others from another work-group). See the "Taking Advantage Of Commutativity" section from this whitepaper for the full explanation and code.

来源：https://stackoverflow.com/questions/25473593/gpgpu-consequence-of-having-a-common-pc-in-a-warp

标签

cuda

opencl

gpgpu

program-counter