opencl | 易学教程

How to create OpenCL command queue?

阅读更多关于 How to create OpenCL command queue?

问题 I'm trying to learn OpenCL but I can't even make a simple kernel to work. The code below comes from the book "OpenCL Programming by Example", which I modified, modified, modified... and still, I have no clues what's the problem. Every time I execute the program in my PC (AMD Athlon 5350 APU with Radeon R3), it prints the result as "0.0000". If I run the same executable, in my other machine (which is a clone of this HD, so everything is the same) with a NVIDIA 1080 TI, the program outputs "3

OpenCV: Failed to load OpenCL runtime

阅读更多关于 OpenCV: Failed to load OpenCL runtime

问题 I am running a program in which I got the error on the title of my question. I have found an answer here that suggests to download OpenCV from github, then compile with ENABLE_OPENCL=OFF (using CMake) and use the built libs against the application. Or may be is there a way to set this flag in the program itself without modifying anything in OpenCV ? I wonder if it is possible to do that without having to remove and install again OpenCV-3.0 ? 回答1: I resolved the problem by running: sudo apt

OpenCV: Failed to load OpenCL runtime

阅读更多关于 OpenCV: Failed to load OpenCL runtime

Matrix multiplication on GPU. Memory bank conflicts and latency hiding

阅读更多关于 Matrix multiplication on GPU. Memory bank conflicts and latency hiding

问题 Edit: achievements over time is listed at the end of this question(~1Tflops/s yet). Im writing some kind of math library for C# using opencl(gpu) from C++ DLL and already done some optimizations on single precision square matrix-matrix multiplicatrion(for learning purposes and possibility of re-usage in a neural-network program later). Below kernel code gets v1 1D array as rows of matrix1(1024x1024) and v2 1D array as columns of matrix2((1024x1024)transpose optimization) and puts the result

unroll loops in an AMD OpenCL kernel

阅读更多关于 unroll loops in an AMD OpenCL kernel

问题 I'm trying to assess the performance differences between OpenCL for AMD .I have kernel for hough transfrom in the kernel i have two #pragma unroll statements when run the kernel not produce any speedup kernel void hough_circle(read_only image2d_t imageIn, global int* in,const int w_hough,__global int * circle) { sampler_t sampler=CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; int gid0 = get_global_id(0); int gid1 = get_global_id(1); uint4 pixel; int x0=0,y0=0,r;

OpenCL error: undefined reference to `_Z12atom_cmpxchgPVU8CLglobalmmm()'

阅读更多关于 OpenCL error: undefined reference to `_Z12atom_cmpxchgPVU8CLglobalmmm()'

问题 When compiling the following OpenCL kernel: #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable __kernel void kernel(__global ulong* mem) { atom_cmpxchg(&mem[0], 0, 1); } I get the following error: error: undefined reference to `_Z12atom_cmpxchgPVU8CLglobalmmm()' I'm using OpenCL from Rust with the OCL library. My OpenCL version is 1.2, my GPU is an Intel(R) Iris(TM) Graphics 550, I'm under macOS Sierra 10.12.1. 回答1: Check the CL_DEVICE_EXTENSIONS of your device with clGetDeviceInfo()

OpenCL Video Processing

阅读更多关于 OpenCL Video Processing

问题 I'm about to write a stacking software. Therefore I want to extract the frames of one or more videofiles to an opencl buffer and then process them with an opencl kernel . But I don't know how to load the video frames as I never worked with videos. As I use opencl my main focus is obviously high performance ! I know there are libraries like ffmpeg or opencv and more, but as I'm not into it I don't know which fits my needs best. So can you give me advice which library/function to use which

OpenCL Limit on for loop size?

阅读更多关于 OpenCL Limit on for loop size?

问题 UPDATE: clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(double), C, 0, NULL, NULL); is returning -5, CL_OUT_OF_RESOURCES . This funciton/call should never return this! I've started using OpenCL and have come across a problem. If I allow a for loop (in the kernel) to run 10000 times I get all of C to be 0 if I allow the loop to run for 8000 the results are all correct. I have added waits around the kernel to ensure it completes, thinking I was pulling the data out

A weird Timinig issue with “clEnqueueNDRangeKernel” in OpenCL

阅读更多关于 A weird Timinig issue with “clEnqueueNDRangeKernel” in OpenCL

问题 I'm new to opencl and I'm experiencing a weird issue with it! I have a reduction kernel and I repeat it several times! The problem is that when I profile the execution of kernel the elapsed time (queued->end) is almost same and a bit increasing but when I measure the elasped time within "C++" code the time for the execution of line "clEnqueueNDRangeKernel" increases with a rapid rate!! I have attached both the code and the output of profiling! :shock: // execute the kernel globalWorkSize[0] =

How to accumulate vectors in OpenCL?

阅读更多关于 How to accumulate vectors in OpenCL?

问题 I have a set of operations running in a loop. for(int i = 0; i < row; i++) { sum += arr1[0] - arr2[0] sum += arr1[0] - arr2[0] sum += arr1[0] - arr2[0] sum += arr1[0] - arr2[0] arr1 += offset1; arr2 += offset2; } Now I'm trying to vectorize the operations like this for(int i = 0; i < row; i++) { convert_int4(vload4(0, arr1) - vload4(0, arr2)); arr1 += offset1; arr2 += offset2; } But how do I accumulate the resulting vector in the scalar sum without using a loop? I'm using OpenCL 2.0. 回答1: The