opencl | 易学教程

How does instruction level parallelism and thread level parallelism work on GPUs?

阅读更多关于 How does instruction level parallelism and thread level parallelism work on GPUs?

问题 Let's say I'm trying to do a simple reduction over an array size n, say kept within one work unit... say adding all the elements. The general strategy seems to be to spawn a number of work items on each GPU, which reduce items in a tree. Naively this would seem to take log n steps, but it's not as if the first wave of threads all do these threads go in one shot, is it? They get scheduled in warps. for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) { if (local_index < offset) {

【记录一个问题】android opencl c++: 不要Context, CommandQueue类的赋值函数

阅读更多关于【记录一个问题】android opencl c++: 不要Context, CommandQueue类的赋值函数

一开始代码中这样写了： cl::Context ctx = cl::Context(CL_DEVICE_TYPE_GPU, NULL); cl::CommandQueue queue= cl::CommandQueue(ctx, devices[device_index], CL_QUEUE_PROFILING_ENABLE, &err); 运行时发现运行正常，但是始终无法得到正确的结果。找了另一个正常的例子逐个替换对象，最终发现这两个对象上面的赋值函数导致了问题。因此，除非看了源码确认可以这么用，否则还是应该避免对象的值拷贝或者使用赋值函数。很可能新对象虽然得到了所有内容，但是老对象析构后又导致新对象指向的内容出错。修改为以下代码后正常： cl::Context* ctx = new cl::Context(CL_DEVICE_TYPE_GPU, NULL); cl::CommandQueue* queue = new cl::CommandQueue(ctx, devices[device_index], CL_QUEUE_PROFILING_ENABLE, &err); 来源： https://www.cnblogs.com/ahfuzhang/p/11605057.html

ERROR: pyopencl: creating context for specific device

阅读更多关于 ERROR: pyopencl: creating context for specific device

问题 I want to create context for specific device on my platform. But I am getting an error. Code: import pyopencl as cl platform = cl.get_platforms() devices = platform[0].get_devices(cl.device_type.GPU) ctx = cl.Context(devices[0]) The error i am getting: Traceback (most recent call last): File "D:\Programming\Programs_OpenCL_Python\Matrix Multiplication\3\main3.py", line 16, in <module> ctx = cl.Context(devices[0]) AttributeError: 'Device' object has no attribute '__iter__' The program compiles

Passing struct to GPU with OpenCL that contains an array of floats

阅读更多关于 Passing struct to GPU with OpenCL that contains an array of floats

I currently have some data that I would like to pass to my GPU and the multiply it by 2. I have created a struct which can be seen here: struct GPUPatternData { cl_int nInput,nOutput,patternCount, offest; cl_float* patterns; }; This struct should contain an array of floats. The array of floats I will not know untill run time as it is specified by the user. The host code: typedef struct GPUPatternDataContatiner { int nodeInput,nodeOutput,patternCount, offest; float* patterns; } GPUPatternData; __kernel void patternDataAddition(__global GPUPatternData* gpd,__global GPUPatternData* output) { int

Get optimum local/global workgroup size in OpenCL?

阅读更多关于 Get optimum local/global workgroup size in OpenCL?

I am using the following function to get the best local and workgroup size for my OpenCL application. //maxWGSize == CL_KERNEL_WORK_GROUP_SIZE //wgMultiple == CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE //compUnits == CL_DEVICE_MAX_COMPUTE_UNITS //rems == max required work items void MyOpenCL::getBestWGSize(cl_uint maxWGSize, cl_uint wgMultiple, cl_uint compUnits, cl_uint rems, size_t *gsize, size_t *lsize) const { cl_uint cu = 1; if(wgMultiple <= rems) { bool flag = true; while(flag) { if(cu < compUnits) { cu++; if((wgMultiple * cu) > rems) { cu--; flag = false; break; } } else if(wgMultiple

Get optimum local/global workgroup size in OpenCL?

阅读更多关于 Get optimum local/global workgroup size in OpenCL?

问题 I am using the following function to get the best local and workgroup size for my OpenCL application. //maxWGSize == CL_KERNEL_WORK_GROUP_SIZE //wgMultiple == CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE //compUnits == CL_DEVICE_MAX_COMPUTE_UNITS //rems == max required work items void MyOpenCL::getBestWGSize(cl_uint maxWGSize, cl_uint wgMultiple, cl_uint compUnits, cl_uint rems, size_t *gsize, size_t *lsize) const { cl_uint cu = 1; if(wgMultiple <= rems) { bool flag = true; while(flag) { if

Using #include to load OpenCL code

阅读更多关于 Using #include to load OpenCL code

问题 I've seen this done long ago with hlsl/glsl shader code -- using an #include on the source code file that pastes the code into a char* so that no file IO happens at runtime. If I were to represent it as pseudo-code, it would look a little like this: #define CLSourceToString(filename) " #include "filename" " const char* kernel = CLSourceToString("kernel.cl"); Now of course that #define isn't going to work because it'll just try to use those quotation marks to start strings. 回答1: See the bullet

Is there a limit to OpenCL local memory?

阅读更多关于 Is there a limit to OpenCL local memory?

问题 Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code. I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down. So could this behavior of OpenCL mean, that I allocated to much _

Using #include to load OpenCL code

阅读更多关于 Using #include to load OpenCL code

I've seen this done long ago with hlsl/glsl shader code -- using an #include on the source code file that pastes the code into a char* so that no file IO happens at runtime. If I were to represent it as pseudo-code, it would look a little like this: #define CLSourceToString(filename) " #include "filename" " const char* kernel = CLSourceToString("kernel.cl"); Now of course that #define isn't going to work because it'll just try to use those quotation marks to start strings. See the bullet physics engines use of OpenCL for how to do this to a kernel . In C++ / C source #define MSTRINGIFY(A) #A

Using OpenCL accelerated functions with OpenCV3 in Python

阅读更多关于 Using OpenCL accelerated functions with OpenCV3 in Python

OpenCV3 introduced its T-API (Transparent API) which gives the user the possibility to use functions which are GPU (or other OpenCL enabled device) accelerated, I'm struggling to find how to tap into that with Python. With C++ there are calls like ocl::setUseOpenCL(true); that enable OpenCL acceleration when you use UMat instead of Mat objects. However I found no documentation whatsoever for Python. Does anybody have any sample code, links or guides on how to achieve OpenCL acceleration with OpenCV3 in Python? UPDATE: After some further digging I've found this in modules/core/include/opencv2