shuffle intrinsics with non-default mask providing data from inactive threads to active threads
问题 I'm using CUDA 9 on a Pascal architecture, trying to implement a reasonable block reduction using warp shuffle intrinsics plus a shared memory intermediate step. Examples I've seen on the web: Using CUDA Warp Level Primitives Faster Parallel Reductions -- Kepler The first of those links illustrate the shuffle intrinsics with _sync, and how to use __ballot_sync() , but only goes as far as a single warp reduction. The second of those links is a Kepler-era article that doesn't use the newer