opencl | 易学教程

Install OpenCL(AMD SDK kit) on linux without ROOT privilege

阅读更多关于 Install OpenCL(AMD SDK kit) on linux without ROOT privilege

I am trying to install OpenCL(AMD) on linux, but I am stuck on the last step(install ICD) It seems like ICD HAS to be installed at /etc/OpenCL/vendor, but I don’t have root access to the computer. Is there any way to make OpenCL work without installing ICD? (or maybe through an environment variable to add search path for ICD files?) It just seems really inconvenient for people like us when ICD file path is hardcoded. Put the ICD-files in /some/path/icd and then export the path like so: export OPENCL_VENDOR_PATH=/some/path/icd It used to work in previous versions at least. I would be surprised

Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

阅读更多关于 Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

Lets assume that I have a computer which has a multicore processor and a GPU. I would like to write an OpenCL program which runs on all cores of the platform. Is this possible or do I need to choose a single device on which to run the kernel? In theory yes, you can, the CL API allows it. But the platform/implementation must support it, and i don't think most CL implementatations do. To do it, get the cl_device_id of the CPU device and the GPU device, and create a context with those two devices, using clCreateContext. No you can't span automagically a kernel on both CPU and GPU, it's either one

OpenCL computation freezes the screen

阅读更多关于 OpenCL computation freezes the screen

As the title says, when I run my OpenCL kernel the entire screen stops redrawing (the image displayed on monitor remains the same until my program is done with calculations. This is true even in case I unplug it from my notebook and plug it back - allways the same image is displayed) and the computer does not seem to react to mouse moves either - the cursor stays in the same position. I am not sure why this happens. Could it be a bug in my program, or is this a standard behaviour ? While searching on Google I found this thread on AMD's forum and some people there suggested it's normal as the

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

阅读更多关于 Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

const char programSource[] = "__kernel void vecAdd(__global int *a, __global int *b, __global int *c)" "{" " int gid = get_global_id(0);" "for(int i=0; i<10; i++){" " a[gid] = b[gid] + c[gid];}" "}"; The kernel above is a vector addition done ten times per loop. I have used the programming guide and stack overflow to figure out how global memory works, but I still can't figure out by looking at my code if I am accessing global memory in a good way. I am accessing it in a contiguous fashion and I am guessing in an aligned way. Does the card load 128kb chunks of global memory for arrays a, b,

OpenCL matrix multiplication should be faster?

阅读更多关于 OpenCL matrix multiplication should be faster?

I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS). I found studies where they got speedup >200x ( >1000 Gflops ) . ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU

clock() in opencl

阅读更多关于 clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit). The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL . The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock. I have never tested this with OpenCL but use it in CUDA. Please be warned that the compiler

Build OpenCV with OpenCL support

阅读更多关于 Build OpenCV with OpenCL support

in CMake, I built OpenCV with OpenCL Enable ON(It automatically detected the OPENCL_INCLUDE_DIR path but the OPENCL_LIBRARY was empty, even after clicking config. for OPENCL_LIBRARY i don't see browse button either .. after generating opencv binaries then i run the below code #include <iostream> #include <fstream> #include <string> #include <iterator> #include <opencv2/opencv.hpp> #include <opencv2/core/ocl.hpp> int main() { if (!cv::ocl::haveOpenCL()) cout << "OpenCL is not avaiable..." << endl; else cout << "OpenCL is AVAILABLE! :) " << endl; //this is the output cv::ocl::setUseOpenCL(true);

OpenCL research/ academic papers

阅读更多关于 OpenCL research/ academic papers

I'm about to start my honours project at uni on OpenCL and how it can be used to improve modern game development. I know there is a couple of books out now/soon about learning opencl but I was wondering if anyone knows any good papers on opencl. I've been looking but can't seem to find any. Part of my project requires a literary review and contrast so any help on this would be appreciated. I'll not point you directly to any papers, instead I'll give you a few hints on where to look for them. Google scholar , One of the best places on the web to search for papers on any subject. Searching for

Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL

阅读更多关于 Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL

The Peak GFLOPS of the the cores for the Desktop i7-4770k @ 4GHz is 4GHz * 8 (AVX) * (4 FMA) * 4 cores = 512 GFLOPS . But the latest Intel IGP (Iris Pro 5100/5200) has a peak of over 800 GFLOPS . Some algorithms will therefore run even faster on the IGP. Combining the cores with the IGP together would even be better. Additionally, the IGP keeps eating up more silicon. The Iris Pro 5100 takes up over 30% of the silicon now. It seems clear which direction Intel desktop processors are headed. As far as I have seen the Intel IGP, however, is mostly ignored by programmers with the exception of

Which memory access pattern is more efficient for a cached GPU?

阅读更多关于 Which memory access pattern is more efficient for a cached GPU?

So lets say I have a global array of memory: |a|b|c| |e|f|g| |i|j|k| | There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads): 0 -> 1 -> 2 -> 3 t1 a -> b -> c -> . t2 e -> f -> g -> . t3 i -> j -> k -> . t4 . . . `> . The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern