OpenCL buffer allocation and mapping best practice

问题

I am a little confused as to whether my code using OpenCL mapped buffers are correct.

I have two examples, one using CL_MEM_USE_HOST_PTR and one using CL_MEM_ALLOC_HOST_PTR. Both work and run on my local machine and OpenCL devices but I am interested in whether this is the correct way of doing the mapping, and whether it should work an all OpenCL devices. I am especially unsure about the USE_HOST_PTR example.

I am only interested in the buffer/map specific operations. I am aware I should do error checking and so forth.

CL_MEM_ALLOC_HOST_PTR:

// pointer to hold the result
int * host_ptr = malloc(size * sizeof(int));

d_mem = clCreateBuffer(context,CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR,
                       size*sizeof(cl_int), NULL, &ret);

int * map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_WRITE,
                                   0,size*sizeof(int),0,NULL,NULL,&ret);
// initialize data
for (i=0; i<size;i++) {
  map_ptr[i] = i;
}

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL); 

//Set OpenCL Kernel Parameters
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_mem);

size_t global_work[1]  = { size };
//Execute OpenCL Kernel
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
                             global_work, NULL, 0, 0, NULL);

map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_READ,
                             0,size*sizeof(int),0,NULL,NULL,&ret);
// copy the data to result array 
for (i=0; i<size;i++){
  host_ptr[i] = map_ptr[i];
} 

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);        

// cl finish etc

CL_MEM_USE_HOST_PTR:

// pointer to hold the result
int * host_ptr = malloc(size * sizeof(int));
int i;
for(i=0; i<size;i++) {
  host_ptr[i] = i;
}

d_mem = clCreateBuffer(context,CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR,
                       size*sizeof(cl_int), host_ptr, &ret);

// No need to map or unmap here, as we use the HOST_PTR the original data
// is already initialized into the buffer?

//Set OpenCL Kernel Parameters
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_mem);

size_t global_work[1]  = { size };
//Execute OpenCL Kernel
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
                             global_work, NULL, 0, 0, NULL);

// this returns the host_ptr so need need to save it (I assume it always will?)
// although we do need to call the map function
// to ensure the data is copied back.
// There's no need to manually copy it back into host_ptr
// as it uses this by default
clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_READ,
                   0,size*sizeof(int),0,NULL,NULL,&ret); 

ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);        

// cl finish, cleanup etc

回答1:

If you use CL_MEM_ALLOC_HOST_PTR you have the chance that the underlying implementation of OpenCL might use page-locked memory.

That means that the page cannot be swapped out to disk and that the transfer between host and device memory would be done DMA style without wasting CPU cycles. Therefore in this case CL_MEM_ALLOC_HOST_PTR would be the best solution.

nVidia has the page-locked (pinned) memory feature and they should also use it in their OpenCL implementation. For AMD it's not certain if they do the same. Check here for more details.

Using CL_MEM_USE_HOST_PTR would just make the programmer's life easier so in the unlikely case when the hardware cannot use page-locked memory you could just use this option.

来源：https://stackoverflow.com/questions/26277268/opencl-buffer-allocation-and-mapping-best-practice

标签

c++

performance

memory-management

opencl