Cuda thread scheduling - latency hiding

问题

When is a cuda thread (or a whole warp), that performs a read from global memory, put to sleep by the scheduler? Let's say I do some computations in the kernel, right after the memory read, that do not depend on the read data. Can these be executed while the data from the global read isn't there yet?

回答1:

A memory read by itself does not cause a stall (barring the cases where the LD/ST unit is unavailable).

The thread stall will occur when the result of that memory read operation needs to be used by another operation.

The compiler is aware of this and will attempt to reorder independent (SASS) instructions so that a read will be followed by independent instructions.

However, once code is compiled, the instruction sequence is not altered (CUDA GPUs currently do not perform speculative execution or out-of-order execution). So once the operation that depends on the read occurs in the (SASS) instruction stream, that thread will stall until the read operation is complete. (1)

Therefore if you did something like this:

float a = global_data[idx];
float b = c*d;
a = a*b;

Then line 1 of the above code will not cause a thread stall. Line 2 will not cause a stall assuming c and d are ready/available. Line 3 will cause a stall if the value of a has not been retrieved from global memory by the time that line is encountered. (Since it also depends on b, there will probably be some arithmetic latency -- possibly a stall -- while b is passing through the multiply pipe, but this arithmetic latency may be much shorter than global memory latency.)

As already mentioned, even if you don't write code this way, the compiler will generally attempt to re-order independent operations such that the situation is more favorable. For example if you wrote the code this way:

float b = c*d;
float a = global_data[idx];
a = a*b;

it's quite possible the underlying SASS code might not be significantly different. Even if you do something like this:

float b = c*d;
float a = global_data[idx]*b;

the compiler will break the second line of code into (at least) two separate operations: the load of global_data[idx] into a register, followed by a multiply operation. Again, the underlying SASS code in any of these realizations may not be substantially different.

(1) Fermi cc2.1 and cc3.x and higher SMs generally have the capability for multiple issue, ie. superscalar operation. This means that multiple (independent SASS) instructions from the same instruction stream, for the same warp, can be scheduled in the same cycle subject to resource limits and restrictions. I don't consider such multiple-issue cases to contradict the statements about speculative or OOO execution, and I don't consider that to materially impact the discussion above. Once a thread has stalled, i.e. the opportunity to issue instructions within the confines of the instruction scheduler mechanism has "dried up", then no further instructions can/will be scheduled until the stall is removed. Low-level details of the capabilities and limitations of the multiple-issue mechanism are unpublished AFAIK.

Slide 14 here may be of interest.

来源：https://stackoverflow.com/questions/35628624/cuda-thread-scheduling-latency-hiding

标签

cuda