opencl | 易学教程

Branch predication on GPU

阅读更多关于 Branch predication on GPU

I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches. For example I have a code like this: if (C) A else B so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on? Thanks All of the CUDA capable architectures released so far operate like

How can I test for OpenCL compatibility?

阅读更多关于 How can I test for OpenCL compatibility?

I have a MacBook Pro 13' with an integrated Intel HD 3000 and a i7 core. I have to use Parallel Programming. My teaching advisors couldn't tell me if it would work with my MacBook. Is there a test I could run on my Laptop for testing? + I found this, but there is only a Linux and Windows SDK ... maybe the Linux version works also for Mac. What should I do? vocaro's answer is absolutely correct; you can always use the CPU compute device on Snow Leopard and Lion, even if your particular graphics chip doesn't support OpenCL. The following program will show you the OpenCL-capable devices on a

How do I use local memory in OpenCL?

阅读更多关于 How do I use local memory in OpenCL?

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time. For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source: __kernel square( __global float *input, __global float *output, const unsigned int count) { int i = get_global_id(0); if (i < count) output[i] = input[i]

How to use OpenCL on Android?

阅读更多关于 How to use OpenCL on Android?

For plattform independence (desktop, cloud, mobile, ...) it would be great to use OpenCL for GPGPU development when speed does matter. I know Google pushes RenderScript as an alternative, but it seems to be only be available for Android and is unlikely to be ever included in iOS. Therefore I seek for a solution to execute OpenCL code within Android Apps. prunge The only Android devices I know that support OpenCL are the ones based on the Mali T600 family of chips (article here ). They have an OpenCL SDK . Apparently it is OpenCL 1.1 full profile as well. The Nexus 10 is a device that uses such

error code (-11):: what are all possible reasons of getting error “cl_build_program_failure” in OpenCL?

阅读更多关于 error code (-11):: what are all possible reasons of getting error “cl_build_program_failure” in OpenCL?

问题 I am using ATI RV770 graphics card, OpenCl 1.0 and ati-stream-sdk-v2.3-lnx64 on linux. While running my host code which includes following two sections to build kernel program, i am getting error code (-11) i.e. cl_build_program_failure . Does it means that kernel program compiled, if not then how is it compiled and debugged? const char* KernelPath = "abc_kernel.cl"; //kernel program is in separate file but in same directory of host code.. / * Create Program object from the kernel source * **

Installing additional files with CMake

阅读更多关于 Installing additional files with CMake

问题 I am attempting to supply some "source" files with some executables. I was wondering if there was a way to copy these source files to the build directory (From the source directory) then to the install directory using CMake. My more specific goal here is to include OpenCL kernels that I write in their own *.cl files. Example: mkdir build cd build cmake .. make Now my directory should have an executable (standard CMake) and some_opencl_kernel.cl which I open in my executable. 回答1: You can copy

ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

阅读更多关于 ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

问题 After finally managing to get my code to compile with OpenCL, I cannot seem to get the output binary to run! This is on my linux laptop running Kubuntu 13.10 x64 The error I get is (Printed from cl::Error): ERROR: clGetPlatformIDs -1001 I found this post but there does not seem to be a clear solution. I added myself to the video group but this does not seem to work. With regards to the ICD profile... I am not sure what I need to do - shouldn't this be included with the cuda toolkit? If not,

Is private memory slower than local memory?

阅读更多关于 Is private memory slower than local memory?

问题 I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%. I wanted still more speed up so copied from local to private which degraded the performance So is it correct that I think we must not use to much private memory which may degrade the performance? 回答1: Ashwin's answer is in the right direction but a little misleading. OpenCL abstracts the address space of variables away from their physical storage, and there

How to declare local memory in OpenCL?

阅读更多关于 How to declare local memory in OpenCL?

问题 I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100. __kernel void myKernel( const int length, const int height, and a bunch of other parameters) { //declare some local arrays to be shared by all 100 work item in this group __local float LP [length]; __local float LT [height]; __local int bitErrors = 0; __local bool failed = false; //here come my actual computations which utilize the space in LP and LT } This however

Number of Compute Units corresponding to the number of work groups

阅读更多关于 Number of Compute Units corresponding to the number of work groups

问题 I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS , the result is 2. I read the number of work groups for running a kernel should correspond to the number of compute units (Heterogenous Computing with OpenCL, Chapter 9, p. 186), otherwise it would waste too much global memory bandwitdh. Also the chip is specified to have 16 cuda cores (which correspond to PEs I believe). Does that mean