Why bother to know about CUDA Warps?

前端 未结 2 1596
慢半拍i
慢半拍i 2020-12-23 10:49

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores. It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time)

2条回答
  •  情歌与酒
    2020-12-23 11:11

    Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?

    Yes, the programming model guarantees that the threads are grouped into warps in that specific order.

    As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }

    Exactly :)

    How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.

    Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).

    This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.

提交回复
热议问题