opencl | 易学教程

OpenCL distribution

阅读更多关于 OpenCL distribution

问题 I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions. My question however is regarding which OpenCL-implementation to use. E.g

Using OpenCL accelerated functions with OpenCV3 in Python

阅读更多关于 Using OpenCL accelerated functions with OpenCV3 in Python

问题 OpenCV3 introduced its T-API (Transparent API) which gives the user the possibility to use functions which are GPU (or other OpenCL enabled device) accelerated, I'm struggling to find how to tap into that with Python. With C++ there are calls like ocl::setUseOpenCL(true); that enable OpenCL acceleration when you use UMat instead of Mat objects. However I found no documentation whatsoever for Python. Does anybody have any sample code, links or guides on how to achieve OpenCL acceleration with

List of OpenCL compliant CPU/GPU

阅读更多关于 List of OpenCL compliant CPU/GPU

问题 How can I know which CPU can be programmed by OpenCL? For example, the Pentium E5200. Is there a way to know w/o running and querying it? 回答1: OpenCL compatibility can generally be determined by looking on the vendor's sites. AMD's APP SDK requires CPUs to support at least SSE2. They also have a list of currently supported ATI/AMD video cards. The most official source is probably the Khronos conformance list: http://www.khronos.org/conformance/adopters/conformant-products#opencl For

Compiling an OpenCL program using a CL/cl.h file

阅读更多关于 Compiling an OpenCL program using a CL/cl.h file

问题 I have sample "Hello, World!" code from the net and I want to run it on the GPU on my university's server. When I type "gcc main.c," it responds with: CL/cl.h: No such file or directory What should I do? How can I have this header file? 回答1: Make sure you have the appropriate toolkit installed. This depends on what you intend running your code on. If you have an NVidia card then you need to download and install the CUDA-toolkit which also contains the necessary binaries and libraries for

Processor Affinity in OpenCL

阅读更多关于 Processor Affinity in OpenCL

问题 Can we impose procssor affinity in OpenCl? For example thread# 1 executes on procesor# 5, thread# 2 executes on procesor# 6, thread# 3 executes on procesor# 7, and so on ? Thanks 回答1: You can't specify affinity at that low level with OpenCL as far as I know. But, starting with OpenCL 1.2 have some control over affinity by partitioning into subdevices using clCreateSubDevices (possibly with one processor in each subdevice by using CL_DEVICE_PARTITION_BY_COUNTS, 1 ) and running separate kernel

What's the advantage of the local memory in OpenCL?

阅读更多关于 What's the advantage of the local memory in OpenCL?

问题 I'm wondering the advantage of the local memory in it. Since the global memory can get the item separately and freely. Can't we just use the global memory? For example, we have a 1000*1000 image, and we want add every pixel value 1. We can use 1000*1000's global memory right? Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts? I'll be so appreciate for you, if you give me a simple code of the local memory. 回答1: Cann't we just use the global

Are OpenCL work items executed in parallel?

阅读更多关于 Are OpenCL work items executed in parallel?

问题 I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group. Does it mean that work items are executed in parallel? If so, is it possible/efficient to make 1 work group with 128 work items? 回答1: The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory

Access vector type OpenCL

阅读更多关于 Access vector type OpenCL

问题 I have a variable whithin a kernel like: int16 element; I would like to know if there is a way to adress the third int in element like element[2] so that i would be as same as writing element.s2 So how can i do something like: int16 element; int vector[100] = rand() % 16; for ( int i=0; i<100; i++ ) element[ vector[i] ]++; The way i did was: int temp[16] = {0}; int16 element; int vector[100] = rand() % 16; for ( int i=0; i<100; i++ ) temp[ vector[i] ]++; element = (int16)(temp[0],temp[1],temp

OpenCL / AMD: Deep Learning [closed]

阅读更多关于 OpenCL / AMD: Deep Learning [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 11 months ago . While "googl'ing" and doing some research I were not able to find any serious/popular framework/sdk for scientific GPGPU-Computing and OpenCL on AMD hardware. Is there any literature and/or software I missed? Especially I am interested in deep learning . For all I know

Branch predication on GPU

阅读更多关于 Branch predication on GPU

问题 I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches. For example I have a code like this: if (C) A else B so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then