I'm using OpenCL, and I need to memset()
some array in global device memory. CUDA has a memset()
-like API function, but OpenCL does not. I read this, where I found two possible alternatives:
- using
memset()
on the host with some scratch buffer, thenclEnqueueWriteBuffer()
to copy that to the buffer on the device. Enqueueing (sp?) the following kernel:
__kernel void memset_uint4(__global uint4* mem,__private uint4 val) { mem[get_global_id(0)]=val; }
Which is better? Or rather, under which circumstances/for which platforms is one better than the other?
Note: If the special case of zero'ing memory merits special treatment, that would be nice to know too.
You can use clEnqueueFillBuffer() from OpenCL v1.2. That is exactly what you need. And it is very flexible on how to fill the buffer with patterns.
Here the doc page:
http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueFillBuffer.html
If you are on 1.1 or below.... then you should recur to other approaches.
A great way to do this very fast (if you have extra memory available) is to have a pre-sized initialized array (such as one filled with all zeros) and then do an on device copy any time you need to zero out the buffer. In my experience this was much faster than any of the calls to fill in OpenCL or CUDA. Obviously this is a special case but much faster when I last tested it.
来源:https://stackoverflow.com/questions/18100948/what-is-are-the-fastest-memset-alternatives-for-opencl