What Causes Instruction Replay Overhead in CUDA
问题 I ran the visual profiler on a CUDA application of mine. The application calls a single kernel multiple times if the data is too large. This kernel has no branching. The profiler reports a high instruction replay overhead of 83.6% and a high global memory instruction replay overhead of 83.5% . Here is how the kernel generally looks: // Decryption kernel __global__ void dev_decrypt(uint8_t *in_blk, uint8_t *out_blk){ __shared__ volatile word sdata[256]; register uint32_t data; // Thread ID