opencl

OpenCL C/C++ dynamic binding library (win32 and more)

[亡魂溺海] 提交于 2019-12-06 04:23:36
I'm giving a try at OpenCL, and in order to put this in production I'd like to be able to bind dynamically to OpenCL.DLL (when under Windows), in order to handle 'gracefully' the case where no OpenCL is installed on the host computer. Is there any available library (or code snippet) that takes care of this dynamic binding in C or C++, much like GLEW does for OpenGL ? I'd like to avoid the hassle to do it myself. Thanks, Here you go: http://clcc.sourceforge.net/clew_8h.html Since you're dealing with Win32, the easiest solution is delay loading. If you delay-load OpenCL, and the compiler-added

Disable Nvidia watchdog with OpenCL on Mac OS X 10.7.4

我怕爱的太早我们不能终老 提交于 2019-12-06 04:15:58
I have a OpenCL program which runs fine for small problems but when running larger problems exceeds the 8-10s time limit for running kernels on Nvidia hardware. Although I have no monitors attached to the GPU I am computing on (Nvidia GTX580), the kernel will always be terminated once it runs for around 8-10s. The preliminary research I did on this problem indicates that the Nvidia watchdog should only enforce the time limit if a monitor is connected to the graphics card. However I do not have any monitors connected to the GPU the OpenCl is running on yet this limit is still enforced. Is it

OpenCl code works on a machine but I am getting CL_INVALID_KERNEL_ARGS on another

北城以北 提交于 2019-12-06 04:06:09
问题 I had the following code, which works well on a machine but I when I try to run it on another machine with better graphics card I am getting errors: global[0] = 512; global[1] = 512; local [0] = 16; local [1] = 16; ciErrNum = clEnqueueNDRangeKernel(commandQueue, myKernel, 2, NULL, global, local, 0, NULL, &event); Errors: Error @ clEnqueueNDRangeKernel: CL_INVALID_KERNEL_ARGS Error @ clWaitForEvents: CL_INVALID_KERNEL_ARGS Any idea what is the problem? 回答1: How large are the buffer objects you

OpenCL speed and float point precision

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 04:05:21
I have just started working with OpenCL. However, I have found some weird behavior of OpenCl, which i can't understand. The source i built and tested, was http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism . I have a ATI Radeon HD 4770, and a AMD Fx 6200 3.8 ghz 6 core cpu. Speed Firstly the speed is not linearly to the number of maximum work group items. I ran App profiler to analyze the time spent during the kernel execution. The result was a bit shocking, my GPU which can only handle 256 work items per group, used 2.23008 milliseconds to calculate square of

Optimal Local/Global worksizes in OpenCL

 ̄綄美尐妖づ 提交于 2019-12-06 03:55:43
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL? Is it any universal rule for AMD, NVIDIA, INTEL GPUs? Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)? Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination. NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic

Sum Vector Components in OpenCL (SSE-like)

余生颓废 提交于 2019-12-06 03:07:16
Is there a single instruction to calculate the sum of all components of a float4 , e.g., in OpenCL? float4 v; float desiredResult = v.x + v.y + v.z + v.w; float4 v; float desiredResult = dot(v, (float4)(1.0f, 1.0f, 1.0f, 1.0f)); It's a little more work, because you're multiplying each component by one before adding them, but some GPUs have a dot product instruction built in. So might be faster; might be slower. It depends on your hardware. 来源: https://stackoverflow.com/questions/10811413/sum-vector-components-in-opencl-sse-like

Killing OpenCL Kernels

我只是一个虾纸丫 提交于 2019-12-06 02:39:13
Is there any way to kill a running OpenCL kernel through the OpenCL API? I haven't found anything in the spec. The only solutions I could come up with are 1) periodically checking a flag in the kernel that the host writes to when it wants the kernel to stop, or 2) running the kernel in a separate process and killing the entire process. I don't think either of those are very elegant solutions, and I'm not sure #1 would even work reliably. Eric Bainville No, the OpenCL API doesn't allow to interrupt a running kernel. On some systems, a kernel running for more than a few seconds will be killed by

float VS floatN

[亡魂溺海] 提交于 2019-12-06 01:40:35
Is there any advantage when using floatN instead float in OpenCL? for example float3 position; and float posX, posY, posZ; Thank you It depends on the hardware. NVidia GPUs have a scalar architecture, so vectors provide little advantage on them over writing purely scalar code. Quoting the NVidia OpenCL best practices guide (PDF link): The CUDA architecture is a scalar architecture. Therefore, there is no performance benefit from using vector types and instructions. These should only be used for convenience. It is also in general better to have more work-items than fewer using large vectors.

Selecting number of CPU cores in OpenCL

泪湿孤枕 提交于 2019-12-06 01:35:20
I am comparing performance of OpenMP with that of OpenCL on CPUs and my system has 8 cores. But I need comparisons for 2, 4, 6 and 8 cores respectively. I can activiate number of cores in OpenMP through "set_num_threads(n)" function or an environment variable; But I dont know how could I do same in OpenCL, is there alternative of OpenMP set_num_threads API in OpenCL ? There is no standard way to do this. OpenCL will try to use all of the resources available on an OpenCL device. One possibility you could look into is the device fission extension . It allows you to divide the device (in this

When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of “Pinned or page-locked” memory? Which are the equivalent in OpenCL?

我怕爱的太早我们不能终老 提交于 2019-12-06 01:31:35
问题 I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise: Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device. I have some informations, taken surfing on the web, but I am little bit