opencl

Error while reading same mem positions on different threads

不打扰是莪最后的温柔 提交于 2019-12-12 03:24:37
问题 I have a problem while reading a couple of positions in a double array from different threads. I enqueue the execution with : nelements = nx*ny; err = clEnqueueNDRangeKernel(queue,kernelTvl2of,1,NULL,&nelements,NULL,0,NULL,NULL); kernelTvl2of has (among other) the code size_t k = get_global_id(0); (...) u1_[k] = (float)u1[k]; (...) barrier(CLK_GLOBAL_MEM_FENCE); forwardgradient(u1_,u1x,u1y,k,nx,ny); barrier(CLK_GLOBAL_MEM_FENCE); and forwardgradient has the code: void forwardgradient(global

Concurrent kernel execution not working in AMD A10 APU

大憨熊 提交于 2019-12-12 02:56:21
问题 I have an AMD A10 APU with Radeon R7 GPU. I believe this device supportes concurrent kernel execution. But when i wrote the following code and obtained profiling information it doesnt seem like the kernels are executing concurrently. My openCL code is given below (The kernels within each iteration is added to the same queue and kernels in different iteration are added to different queues and hence should be running in parallel). for(j = 0; j < 8; j++){ cl_err = clEnqueueNDRangeKernel(queue[4

Copying global on-device pointer address back and forth between device and host

我们两清 提交于 2019-12-12 02:49:18
问题 I created a buffer on the OpenCL device (a GPU), and from the host I need to know the global on-device pointer address so that I can put that on-device address in another buffer so that the kernel can then read from that buffer that contains the address of the first buffer so that then it can access the contents of that buffer. If that's confusing here's what I'm trying to do: I create a generic floats-containing buffer representing a 2D image, then from the host I create a todo list of all

Strange behaviour using local memory in OpenCL

不问归期 提交于 2019-12-12 01:43:05
问题 I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose): kernel void TestKernel(global const int* groupOffsets, global float* result, local int* tmpData, const int itemcount) { unsigned int groupid = get_group_id(0); unsigned int globalsize = get_global_size(0); unsigned int groupcount = get_num_groups(0); for

Universal binaries for OpenCL

会有一股神秘感。 提交于 2019-12-12 01:36:24
问题 I have 2 computers, one with Raden R9 290x, and Raden R7 250 in another. The following discussion focuses only on AMD graphics cards. On both machines the same driver installed. I wrote OpenCL kernel, compile it into a binary and use clCreateProgramWithBinary. But I was faced with the following challenges: Compiled binaries for these two devices are different: for R7 binary weighs ~ 500KB, and for R9 ~ 1.5MB. I have no problem when using the binary on the device for which it was compiled,

How many copies of a global variable declared inside an opencl kernel function is maintained in the global address space

佐手、 提交于 2019-12-12 00:42:35
问题 I'm new to Opencl programming. To learn opencl better, after spending some time reading some tutorials, I started developing a simple pattern matching kernel function. But I have some doubts: First, I have global variables declared inside the kernel function. Does it mean every work item shares a single copy of each variable? Second, how can I use the standard C libraries, esp. "string.h". __kernel void matchPatterns_V1(__global char *strings, __global char *patterns, __global int *matchCount

OpenCL kernel performing very poor?

隐身守侯 提交于 2019-12-12 00:37:24
问题 My application takes 5200ms for computation of a data set using OpenCL on GPU , 330ms for same data using OpenCL on CPU ; while the same data processing when done without OpenCL on CPU using multiple threads takes 110ms . The OpenCL timing is done only for kernel execution i.e. start just before clEnqueueNDRangeKernel and end just after clFinish . I have a Windows gadget which tells me that I am only using 19% GPU power. Even if I could make it to 100% still it would take ~1000ms which is

Does OpenCL support a randomly accessed global queue buffer?

夙愿已清 提交于 2019-12-11 21:31:59
问题 I am writing a kernel which processes combinatorial data. Because these sorts of problems generally have a large problem space, where most of the processed data is junk, is there a way I could do the following: (1) If the calculated data passes some sort of condition, it is put onto a global output buffer. (2) Once the output buffer is full, the data is sent back to the host (3) The host takes a copy of the data from the buffer and clears it (4) Then creates a new buffer to be filled by the

Installing Intel SDK for OpenCL Application Setup and got an error below

我的梦境 提交于 2019-12-11 20:45:48
问题 My laptop hardware information is as follows: OS: Windows 7 Professional Service Pack 1 CPU: Intel(R) Core(TM) i7-3540M CPU @ 3.00 GHz RAM: 16,0 GB Graphics: Intel(R) HD Graphics 4000 When I try to install Intel SDK for OpenCL Application 2014 setup, it gives the following error. Would you please help me out in this problem ? Thanks in advance.. 回答1: The answer from Dithermaster is correct. I would add comment to that answer, but currently I cannot due to reputation score limitation. Your

OpenCL Vector add program

Deadly 提交于 2019-12-11 20:04:12
问题 I'm absolutely new to OpenCL programming. I have a working installation of OpenCL library and drivers. But the program I'm trying to run is not producing expected output (Output is all zeros). It is just a simple vector_add program. Thanks in advance for suggestions. int main(int argc, char** argv) { cout << "Hello OpenCL" << endl; vector<Platform> all_platforms; int err = Platform::get(&all_platforms); cout << "Getting Platform ... Error code " << err << endl; if (all_platforms.size()==0)