opencl | 易学教程

Tutorials or books on kernel programming for OpenCL? [closed]

阅读更多关于 Tutorials or books on kernel programming for OpenCL? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . The question is specific enough I suppose. Just to make it clear: I am not looking for a reference, but a tutorial. I am interested specifically in the kernel programming aspect. 回答1: There aren't that many books out there so you can't be very picky. There are two that are more like guides and less like a

Can I use external OpenCl libraries?

阅读更多关于 Can I use external OpenCl libraries?

问题 I want to use some external libraries (http://trac.osgeo.org/geos/) to perform some analytical tasks on Geometry objects(GIS). I want to perform these task using OpenCL on Cuda so that I can use the paralel power of GPU to perform these tasks in parallel on large set of data.So my question is: Can I write kernel using these libraries? Also How can I pass the objects of complex data structures of these libraries as an argument to the kernel/(in specific How can I create buffer of these complex

`clEnqueueFillBuffer()` fills a buffer correctly only at random

阅读更多关于 `clEnqueueFillBuffer()` fills a buffer correctly only at random

问题 I'm trying to fill OpenCL cl_int2 buffer with default values ( {-1, -2} ), however the OpenCL function clEnqueueFillBuffer() fills my buffer with different values each time I run it – the buffer is filled with the expected values only at random. The function returns error code 0 . Examples of the snippet's output at multiple runs: 0 : -268435456 0 : -2147483648 0 : -536870912 0 : 268435456 0 : 0 0 : -1342177280 -1: -2 I'm running OS X 10.11.6 with Radeon HD 6750M and OpenCL version 1.2.

OpenCL/CUDA: Two-stage reduction Algorithm

阅读更多关于 OpenCL/CUDA: Two-stage reduction Algorithm

问题 Reduction of large arrays can be done by calling __reduce(); multiple times. The following code however uses only two stages and is documented here: However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation? __kernel void reduce(__global float* buffer, __local float* scratch, __const int length, __global float* result) { int global_index = get_global_id(0); float accumulator = INFINITY; // Loop sequentially over chunks of input vector

Is CL_DEVICE_LOCAL_MEM_SIZE for the entire device, or per work-group?

阅读更多关于 Is CL_DEVICE_LOCAL_MEM_SIZE for the entire device, or per work-group?

问题 I'm not quite clear of the actual meaning of CL_DEVICE_LOCAL_MEM_SIZE , which is acquired through clGetDeviceInfo function. Is this value indicating the total sum of all the available local memory on a certain device, or the up-limit of local memory share to a work-group? 回答1: TL;DR: Per single processing unit, hence also the maximum allotable to a work unit. This value is the amount of local memory available on each compute unit in the device. Since a work-group is assigned to a single

Run OpenCL program on NVIDIA hardware

阅读更多关于 Run OpenCL program on NVIDIA hardware

问题 I've build a simple OpenCL based program (in C++) and tested in on Windows 8 system with AMD FirePro V4900 card. I was using AMD APP SDK. When I copy my binaries to the other machine (Windows 8 with NVIDIA Quadro 4000 card) I get "The procedure entry point clReleaseDevice couldn't be located in the dynamic linked library (exe of my program)". This second machine has the latest NVIDIA drivers and CUDA 5 installed. Any ideas on what to I need to make it work with NVIDIA hardware? 回答1: Its an

HD Processor Graphics (HD4000) failed to load as a device in Intel OpenCL SDK

阅读更多关于 HD Processor Graphics (HD4000) failed to load as a device in Intel OpenCL SDK

问题 I'm using i7-3770K Ivy Bridge with HD 4000, and I've installed the latest drivers and the newest OpenCL SDK. When tried to run the code samples with CPU, it works just fine. However, when I set the '-g' parameter to run with the processing graphics, the device cannot be found hence it exited with code -1 (which is likely caused by failing to create a CL context). SimpleOptimization, GodRays, and all codes that support Intel Processor Graphics failed to run with the HD4000. I am using Windows

Weak guarantees for non-atomic writes on GPUs?

阅读更多关于 Weak guarantees for non-atomic writes on GPUs?

问题 OpenCL and CUDA have included atomic operations for several years now (although obviously not every CUDA or OpenCL device supports these). But - my question is about the possibility of "living with" races due to non-atomic writes. Suppose several threads in a grid all write to the same location in global memory. Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk? Relevant parameters for this

Is there really a timeout for kernels on nvidia gpus?

阅读更多关于 Is there really a timeout for kernels on nvidia gpus?

问题 searching for answers for why my kernels produce strange error messages or "0" only results I found this answer on SO that mentions that there is a timeout of 5s for kernels running on nvidia gpus? I googled for the timout but I could not find confirming sources or more information. What do you know about it? Could the timout cause strange behaviour for kernels with a long runtime? Thanks! 回答1: Further googling brought up this in the CUDA_Toolkit_Release_Notes_Linux.txt (Known Issus): #

When writing openCL code, how does it perform on a single-core machine without a GPU?

阅读更多关于 When writing openCL code, how does it perform on a single-core machine without a GPU?

问题 Hey all, I Am currently porting a raytracer from FORTRAN 77 to C for a research project. After having ported the essentials, the question is how we proceed to parallelization. In the lab, I have access to a couple of different Opteron machines, with between 2 and 8 cores, but no GPUs (for now). We are running 64b gentoo. A GPGPU version would be (very) desirable, but with only one programmer on the project, maintaining separate non-GPU and GPU versions isn't an option. Also, the code will be