How does instruction level parallelism and thread level parallelism work on GPUs?
问题 Let's say I'm trying to do a simple reduction over an array size n, say kept within one work unit... say adding all the elements. The general strategy seems to be to spawn a number of work items on each GPU, which reduce items in a tree. Naively this would seem to take log n steps, but it's not as if the first wave of threads all do these threads go in one shot, is it? They get scheduled in warps. for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) { if (local_index < offset) {