How do I use local memory in OpenCL?

老子叫甜甜 提交于 2019-11-27 19:16:19

Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.

Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        // if the threads were reading data from other threads, then we would
        // want a barrier here to ensure the write completes before the read
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}

There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:

__local float localBuffer[1024];

This removes code due to less clSetKernelArg calls.

In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.

If you are not careful with local memory, you may end up with worse performance some time than using global memory.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!