What\'s the relationship between maximum work group size and warp size? Let’s say my device has 240 CUDA streaming processors (SP) and returns the following information -
<
The warp size is the number of threads that a multiprocessor executes concurrently. An NVIDIA multiprocessor can execute several threads from the same block at the same time, using hardware multithreading.
It's important to consider the warp size, since all memory accesses are coalesced into multiples of the warp size (32 bytes, 64 bytes, 128 bytes), and this improves performance.
The CUDA C Best Practices Guide contains all the technical information about these kind of optimizations.