opencl

Strategy for doing final reduction

被刻印的时光 ゝ 提交于 2019-12-19 16:26:10
问题 I am trying to implement an OpenCL version for doing reduction of a array of float. To achieve it, I took the following code snippet found on the web : __kernel void sumGPU ( __global const double *input, __global double *partialSums, __local double *localSums) { uint local_id = get_local_id(0); uint group_size = get_local_size(0); // Copy from global memory to local memory localSums[local_id] = input[get_global_id(0)]; // Loop for computing localSums for (uint stride = group_size/2; stride>0

OpenCL reduction result wrong with large floats

我怕爱的太早我们不能终老 提交于 2019-12-19 11:56:54
问题 I used AMD's two-stage reduction example to compute the sum of all numbers from 0 to 65 536 using floating point precision. Unfortunately, the result is not correct. However, when I modify my code, so that I compute the sum of 65 536 smaller numbers (for example 1), the result is correct. I couldn't find any error in the code. Is it possible that I am getting wrong results, because of the float type? If this is the case, what is the best approach to solve the issue? 回答1: There is probably no

OpenCL float sum reduction

▼魔方 西西 提交于 2019-12-19 07:36:27
问题 I would like to apply a reduce on this piece of my kernel code (1 dimensional data): __local float sum = 0; int i; for(i = 0; i < length; i++) sum += //some operation depending on i here; Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum. In pseudo code, I would like to able to write something like this: int i = get_global_id(0); __local float sum = 0; sum += //some operation

OpenCL float sum reduction

痞子三分冷 提交于 2019-12-19 07:36:14
问题 I would like to apply a reduce on this piece of my kernel code (1 dimensional data): __local float sum = 0; int i; for(i = 0; i < length; i++) sum += //some operation depending on i here; Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum. In pseudo code, I would like to able to write something like this: int i = get_global_id(0); __local float sum = 0; sum += //some operation

Passing struct with pointer members to OpenCL kernel using PyOpenCL

半世苍凉 提交于 2019-12-19 03:58:38
问题 Let's suppose I have a kernel to compute the element-wise sum of two arrays. Rather than passing a, b, and c as three parameters, I make them structure members as follows: typedef struct { __global uint *a; __global uint *b; __global uint *c; } SumParameters; __kernel void compute_sum(__global SumParameters *params) { uint id = get_global_id(0); params->c[id] = params->a[id] + params->b[id]; return; } There is information on structures if you RTFM of PyOpenCL [1], and others have addressed

Passing struct to GPU with OpenCL that contains an array of floats

懵懂的女人 提交于 2019-12-18 18:28:36
问题 I currently have some data that I would like to pass to my GPU and the multiply it by 2. I have created a struct which can be seen here: struct GPUPatternData { cl_int nInput,nOutput,patternCount, offest; cl_float* patterns; }; This struct should contain an array of floats. The array of floats I will not know untill run time as it is specified by the user. The host code: typedef struct GPUPatternDataContatiner { int nodeInput,nodeOutput,patternCount, offest; float* patterns; } GPUPatternData;

OpenCL: Store pointer to global memory in local memory?

狂风中的少年 提交于 2019-12-18 17:58:43
问题 any solutions? Is that even possible? __global *float abc; // pointer to global memory stored in private memory I want abc to be stored in local memory instead of private memory. 回答1: I think this is clarified here List 5.2: __global int global_data[128]; // 128 integers allocated on global memory __local float *lf; // pointer placed on the private memory, which points to a single-precision float located on the local memory __global char * __local lgc[8]; // 8 pointers stored on the local

GCC: Compiling an OpenCL host on Windows

北城以北 提交于 2019-12-18 17:29:12
问题 I just wanted to try out using OpenCL under Windows. Abstract : I got an " undefined reference to " error when I tried to compile (using the command gcc my.o -o my.exe -L "C:\Program Files (x86)\AMD APP\lib\x86_64" -l OpenCL ). My Code #include <CL/cl.h> #include <stdio.h> int main(void) { cl_platform_id platform; int err; err = clGetPlatformIDs(1, &platform, NULL); if(err < 0) { perror("There's No Platform!"); exit(1); } /* Some more code... */ system("PAUSE"); } Makefile all: addition

C++ Template preprocessor tool

邮差的信 提交于 2019-12-18 16:57:06
问题 Is there a compiler or standalone preprocessor which takes C++ files and runs a template expansion pass, generating new C++ code with expanded template instantiations? I remember such a tool in the mid-90s when templates were still new and experimental, and the preprocessor was a way to do template programming with compilers without native template support. This is a lot more complicated than a macro-processing step since it would likely require parsing and tokenizing the code to understand

OpenCL distribution

耗尽温柔 提交于 2019-12-18 16:55:17
问题 I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions. My question however is regarding which OpenCL-implementation to use. E.g