opencl | 易学教程

What is the difference between creating a buffer object with clCreateBuffer + CL_MEM_COPY_HOST_PTR vs. clCreateBuffer + clEnqueueWriteBuffer?

阅读更多关于 What is the difference between creating a buffer object with clCreateBuffer + CL_MEM_COPY_HOST_PTR vs. clCreateBuffer + clEnqueueWriteBuffer?

问题 I have seen both versions in tutorials, but I could not find out, what their advantages and disadvantages are. Which one is the proper one? cl_mem input = clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float) * DATA_SIZE, NULL, NULL); clEnqueueWriteBuffer(command_queue, input, CL_TRUE, 0, sizeof(float) * DATA_SIZE, inputdata, 0, NULL, NULL); vs. cl_mem input = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ,sizeof(float) * DATA_SIZE, inputdata, NULL); Thanks. [Update] I added

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

阅读更多关于 Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically: 1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here . In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to

How do I know if the kernels are executing concurrently?

阅读更多关于 How do I know if the kernels are executing concurrently?

问题 I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know that the kernels are executing concurrently? One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone help me out.. 回答1:

Advice for real time image processing

阅读更多关于 Advice for real time image processing

问题 really need some help and advice as I'm new with real time image processing. I am trying to implement an algorithm for a system which the camera get 1000fps, and I need to get the value of each pixel in all images and do the different calculation on the evolution of pixel[i][j] in N number of images, for all the pixels in the images. I have the (unsigned char *ptr) I want to transfer them to the GPU and start implementing the algorithm using CUDA and return the data back to CPU. but I am not

Persistent threads in OpenCL and CUDA

阅读更多关于 Persistent threads in OpenCL and CUDA

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion? What I keep in my mind after reading and googling "persistent threads": Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works. Is this correct? Thanks in advance Reference: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089 http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf CUDA

How to pass and access C++ vectors to OpenCL kernel?

阅读更多关于 How to pass and access C++ vectors to OpenCL kernel?

I'm new to C, C++ and OpenCL and doing my best to learn them at the moment. Here's a preexisting C++ function that I'm trying to figure out how to port to OpenCL using either the C or C++ bindings. #include <vector> using namespace std; class Test { private: double a; vector<double> b; vector<long> c; vector<vector<double> > d; public: double foo(long x, double y) { // mathematical operations // using x, y, a, b, c, d // and also b.size() // to calculate return value return 0.0; } }; Broadly my question is how to pass in all the class members that this function accesses into the binding and

OpenCL user defined inline functions

阅读更多关于 OpenCL user defined inline functions

Is it possible to define my own functions in OpenCL code, in order that the kernels could call them? It yes, where can I see some simple example? Kayhano Function used to create program is ... cl_program clCreateProgramWithSource ( cl_context context, cl_uint count, const char **strings, const size_t *lengths, cl_int *errcode_ret) You can place functions inside the strings parameter like this, float AddVector(float a, float b) { return a + b; } kernel void VectorAdd( global read_only float* a, global read_only float* b, global write_only float* c ) { int index = get_global_id(0); //c[index] =

Better way to load vectors from memory. (clang)

阅读更多关于 Better way to load vectors from memory. (clang)

问题 I'm writing a test program to get used to Clang's language extensions for OpenCL style vectors. I can get the code to work but I'm having issues getting one aspect of it down. I can't seem to figure out how to get clang to just load in a vector from a scalar array nicely. At the moment I have to do something like: byte16 va = (byte16){ argv[1][start], argv[1][start + 1], argv[1][start + 2], argv[1][start + 3], argv[1][start + 4], argv[1][start + 5], argv[1][start + 6], argv[1][start + 7],

OpenCL CPU Device vs GPU Device

阅读更多关于 OpenCL CPU Device vs GPU Device

Consider a simple example: vector addition. If I build a program for CL_DEVICE_TYPE_GPU, and I build the same program for CL_DEVICE_TYPE_CPU, what is the difference between them(except that "CPU program" is running on CPU, and "GPU program" is running on GPU)? Thanks for your help. There are a few differences between the device types. The simple answer to your vector question is: Use a gpu for large vectors, and cpu for smaller workloads. 1) Memory copying. GPUs rely on the data you are working on to be passed into them, and the results are later read back to the host. This is done over PCI-e,

What is the context switching mechanism in GPU?

阅读更多关于 What is the context switching mechanism in GPU?

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds? Thanks First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory