cub

Why is my inclusive scan code 2x faster on CPU than on a GPU?

人走茶凉 提交于 2019-11-27 08:46:00
问题 I wrote a short CUDA program that uses the highly-optimized CUB library to demonstrate that one core from an old, quad-core Intel Q6600 processor (all four are supposedly capable of ~30 GFLOPS/sec) can do an inclusive scan (or cumulative/prefix sum if you rather) on 100,000 elements faster than an Nvidia 750 Ti (supposedly capable of 1306 GFLOPS/sec of single precision). Why is this the case? The source code is: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <cub/cub