opencl | 易学教程

Cannot run OpenCL on an NVIDIA Card ( 'CL/cl_platform.h': No such file or directory)

阅读更多关于 Cannot run OpenCL on an NVIDIA Card ( 'CL/cl_platform.h': No such file or directory)

问题 I am trying to run code written in OpenCL on an NVIDIA GPU. I installed NVIDIA GPU Computing Tool Kit I added the include and lib path to my project properties I set the environment variable by the path of bin But I get this error: CL/cl_platform.h': No such file or directory I notice my project confused between these paths: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\include (which has Cl/cl_platform.h but project do not see it although I set it in include properties ) C:\Program

Using “String” in openCl Kernel

阅读更多关于 Using “String” in openCl Kernel

问题 I have a question about OpenCl programming. The scenario is : I have a list of words taken from a file with different length and I have to pass this list to OpenCl Kernel. I've tried to use a struct compposed by an array of char that contains the word and an int that contains the size. But this kind of solution doesn't work because, in the Kernel, I must create a new array with the size indicated in the struct, but the Kernel doesn't like arrays of variable size. There is a way to implement

Non-recursive random number generator

阅读更多关于 Non-recursive random number generator

问题 I have searched for pseudo-RNG algorithms but all I can find seem to generate the next number by using the previous result as seed. Is there a way to generate them non-recursively? The scenario where I need this is during OpenCL concurrent programming, each thread/pixel needs an independent RNG. I tried to seed them using BIG_NUMBER + work_id , but the result has a strong visual pattern in it. I tried several different RNG algorithms and all have this problem. Apparently they only guarantee

Elementwise operations in OpenCL (Cuda)

阅读更多关于 Elementwise operations in OpenCL (Cuda)

问题 I build a kernel for elementwise multiplication of two matrices, but at least with my configurations my OpenCL kernel is only faster when each matrices is larger than 2GB. So I was wondering, if it is because of my naive kernel (see below) or because of the nature of elementwise operations, meaning that elementwise operations dont gain from using GPUs. Thanks for your input! kernel: KERNEL_CODE = """ // elementwise multiplication: C = A .* B. __kernel void matrixMul( __global float* C, _

Open CL Kernel - every workitem overwrites global memory?

阅读更多关于 Open CL Kernel - every workitem overwrites global memory?

问题 I'm trying to write a kernel to get the character frequencies of a string. First, here is the code I have for kernel right now: _kernel void readParallel(__global char * indata, __global int * outdata) { int startId = get_global_id(0) * 8; int maxId = startId + 7; for (int i = startId; i < maxId; i++) { ++outdata[indata[i]]; } } The variable inData holds the string in the global memory, and outdata is an array of 256 int values in the global memory. Every workitem reads 8 symbols from the

How to write multiline errorformat string?

阅读更多关于 How to write multiline errorformat string?

问题 I want to write OpenCL syntax checker for vim-opencl plugin. OpenCL compiler do some strange formatting of output errors. There are two types of errors. Normal (with small error explanation): "/tmp/OCLUKvOsF.cl", line 143: error: expression must have integral type rec_table[PRIME_P - ri] = PRIME_P - i; ^ And not-normal with line-break in error explanation: "/tmp/OCLUKvOsF.cl", line 148: error: a value of type "uint16" cannot be used to initialize an entity of type "uint" uint a = value, b =

Reduction of matrix rows in OpenCL

阅读更多关于 Reduction of matrix rows in OpenCL

问题 I have an matrix which is stored as 1D array in the GPU, I'm trying to make an OpenCL kernel which will use reduction in every row of this matrix, for example: Let's consider my matrix is 2x3 with the elements [1, 2, 3, 4, 5, 6], what I want to do is: [1, 2, 3] = [ 6] [4, 5, 6] [15] Obviously as I'm talking about reduction, the actual return could be of more than one element per row: [1, 2, 3] = [3, 3] [4, 5, 6] [9, 6] Then the final calculation I can do in another kernel or in the CPU. Well,

GPGPU: Consequence of having a common PC in a warp

阅读更多关于 GPGPU: Consequence of having a common PC in a warp

问题 I read in a book that in a wavefront or warp, all threads share a common program counter. So what is its consequence? Why does that matter? 回答1: NVIDIA GPUs execute 32-threads at a time (warps) and AMD GPUs execute 64-threads at time (wavefronts). The sharing of control logic, fetch, and data paths reduces area and increases perf/area and perf/watt. In order to take advantage of the design programming languages and developers need to understand how to coalesce memory accesses and how to

SVO Rendering: OpenGL or Custom renderer?

阅读更多关于 SVO Rendering: OpenGL or Custom renderer?

问题 I am planning on creating a Sparce Voxel Octree (SVO) Engine and am caught between using openGL to render each little cube or write my own renderer in assembly and c. If I was going to do the latter, I am unsure on how to draw pixels to the screen (I'm on a Mac, 10.8). What graphics context/windowing system would be the preferred method for this (not X, I have my shares of issues with X on my Mac). P.S. the engine will need to be able to draw a minimum of 50.000 cubes (I will use opencl/cuda

OpenCL performance measurement

阅读更多关于 OpenCL performance measurement

问题 What is the most appropriate method to present a performance of OpenCL application (especially computing kernels)? I have implemented some algorithms and I was thinking about presenting speed-up and efficiency charts, but according to the definition I need to know how many processors I have used in calculations. In case of OpenCL it can not be done. 回答1: Create your command queue with the CL_QUEUE_PROFILING_ENABLE flag set, then use clGetEventProfilingInfo to extract timing data. See Chapter