How to verify wavefront/warp size in OpenCL?

。_饼干妹妹 提交于 2020-01-03 18:59:13

问题


I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.

__kernel
void kernel__test_warpsize(
        __global T* dataSet,
        uint size
        )
{   
    size_t idx = get_global_id(0);

    T value = dataSet[idx];
    if (idx<size-1)
        dataSet[idx+1] = value;
}

In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values: dataSet[0]=0 dataSet[1]=0 dataSet[2]=1 ... dataSet[63]=62 dataSet[64]=63 dataSet[65]=63 dataSet[66]=65 ... dataSet[127]=126

However, I found dataSet[65] is 64, not 63, which is not as my expectation.

My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?


回答1:


You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.

In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.

Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.

Based on your comments I edited this part:

Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/ Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.

And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.

But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).




回答2:


CUDA model - what is warp size? I think this is a good answer which explains the warp briefly.

But I am a bit confused about what sharpneli said such as " [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".

I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.



来源:https://stackoverflow.com/questions/19871520/how-to-verify-wavefront-warp-size-in-opencl

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!