opencl | 易学教程

OpenCL C/C++ dynamic binding library (win32 and more)

阅读更多关于 OpenCL C/C++ dynamic binding library (win32 and more)

I'm giving a try at OpenCL, and in order to put this in production I'd like to be able to bind dynamically to OpenCL.DLL (when under Windows), in order to handle 'gracefully' the case where no OpenCL is installed on the host computer. Is there any available library (or code snippet) that takes care of this dynamic binding in C or C++, much like GLEW does for OpenGL ? I'd like to avoid the hassle to do it myself. Thanks, Here you go: http://clcc.sourceforge.net/clew_8h.html Since you're dealing with Win32, the easiest solution is delay loading. If you delay-load OpenCL, and the compiler-added

Disable Nvidia watchdog with OpenCL on Mac OS X 10.7.4

阅读更多关于 Disable Nvidia watchdog with OpenCL on Mac OS X 10.7.4

I have a OpenCL program which runs fine for small problems but when running larger problems exceeds the 8-10s time limit for running kernels on Nvidia hardware. Although I have no monitors attached to the GPU I am computing on (Nvidia GTX580), the kernel will always be terminated once it runs for around 8-10s. The preliminary research I did on this problem indicates that the Nvidia watchdog should only enforce the time limit if a monitor is connected to the graphics card. However I do not have any monitors connected to the GPU the OpenCl is running on yet this limit is still enforced. Is it

OpenCl code works on a machine but I am getting CL_INVALID_KERNEL_ARGS on another

阅读更多关于 OpenCl code works on a machine but I am getting CL_INVALID_KERNEL_ARGS on another

问题 I had the following code, which works well on a machine but I when I try to run it on another machine with better graphics card I am getting errors: global[0] = 512; global[1] = 512; local [0] = 16; local [1] = 16; ciErrNum = clEnqueueNDRangeKernel(commandQueue, myKernel, 2, NULL, global, local, 0, NULL, &event); Errors: Error @ clEnqueueNDRangeKernel: CL_INVALID_KERNEL_ARGS Error @ clWaitForEvents: CL_INVALID_KERNEL_ARGS Any idea what is the problem? 回答1: How large are the buffer objects you

OpenCL speed and float point precision

阅读更多关于 OpenCL speed and float point precision

I have just started working with OpenCL. However, I have found some weird behavior of OpenCl, which i can't understand. The source i built and tested, was http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism . I have a ATI Radeon HD 4770, and a AMD Fx 6200 3.8 ghz 6 core cpu. Speed Firstly the speed is not linearly to the number of maximum work group items. I ran App profiler to analyze the time spent during the kernel execution. The result was a bit shocking, my GPU which can only handle 256 work items per group, used 2.23008 milliseconds to calculate square of

Optimal Local/Global worksizes in OpenCL

阅读更多关于 Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL? Is it any universal rule for AMD, NVIDIA, INTEL GPUs? Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)? Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination. NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic

Sum Vector Components in OpenCL (SSE-like)

阅读更多关于 Sum Vector Components in OpenCL (SSE-like)

Is there a single instruction to calculate the sum of all components of a float4 , e.g., in OpenCL? float4 v; float desiredResult = v.x + v.y + v.z + v.w; float4 v; float desiredResult = dot(v, (float4)(1.0f, 1.0f, 1.0f, 1.0f)); It's a little more work, because you're multiplying each component by one before adding them, but some GPUs have a dot product instruction built in. So might be faster; might be slower. It depends on your hardware. 来源： https://stackoverflow.com/questions/10811413/sum-vector-components-in-opencl-sse-like

Killing OpenCL Kernels

阅读更多关于 Killing OpenCL Kernels

Is there any way to kill a running OpenCL kernel through the OpenCL API? I haven't found anything in the spec. The only solutions I could come up with are 1) periodically checking a flag in the kernel that the host writes to when it wants the kernel to stop, or 2) running the kernel in a separate process and killing the entire process. I don't think either of those are very elegant solutions, and I'm not sure #1 would even work reliably. Eric Bainville No, the OpenCL API doesn't allow to interrupt a running kernel. On some systems, a kernel running for more than a few seconds will be killed by

float VS floatN

阅读更多关于 float VS floatN

Is there any advantage when using floatN instead float in OpenCL? for example float3 position; and float posX, posY, posZ; Thank you It depends on the hardware. NVidia GPUs have a scalar architecture, so vectors provide little advantage on them over writing purely scalar code. Quoting the NVidia OpenCL best practices guide (PDF link): The CUDA architecture is a scalar architecture. Therefore, there is no performance benefit from using vector types and instructions. These should only be used for convenience. It is also in general better to have more work-items than fewer using large vectors.

Selecting number of CPU cores in OpenCL

阅读更多关于 Selecting number of CPU cores in OpenCL

I am comparing performance of OpenMP with that of OpenCL on CPUs and my system has 8 cores. But I need comparisons for 2, 4, 6 and 8 cores respectively. I can activiate number of cores in OpenMP through "set_num_threads(n)" function or an environment variable; But I dont know how could I do same in OpenCL, is there alternative of OpenMP set_num_threads API in OpenCL ? There is no standard way to do this. OpenCL will try to use all of the resources available on an OpenCL device. One possibility you could look into is the device fission extension . It allows you to divide the device (in this

When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of “Pinned or page-locked” memory? Which are the equivalent in OpenCL?

阅读更多关于 When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of “Pinned or page-locked” memory? Which are the equivalent in OpenCL?

问题 I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise: Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device. I have some informations, taken surfing on the web, but I am little bit