opencl | 易学教程

SIMD intrinsics - are they usable on gpus?

阅读更多关于 SIMD intrinsics - are they usable on gpus?

I'm wondering if I can use SIMD intrinsics in a GPU code like a CUDA's kernel or openCL one. Is that possible? No, SIMD intrinsics are just tiny wrappers for ASM code. They are CPU specific. More about them here . Generally speking, why whould you do that? CUDA and OpenCL already contain many "functions" which are actually "GPU intrinsics" (all of these, for example, are single-point-math intrinsics for the GPU ) You use the vector data types built into the OpenCL C language. For example float4 or float8. If you run with the Intel or AMD device drivers these should get converted to SSE/AVX

OpenCL FFT on both Nvidia and AMD hardware?

阅读更多关于 OpenCL FFT on both Nvidia and AMD hardware?

问题 I'm working on a project that needs to make use of FFTs on both Nvidia and AMD graphics cards. I initially looked for a library that would work on both (thinking this would be the OpenCL way) but I wasn't having any luck. Someone suggested to me that I would have to use each vendor's FFT implementation and write a wrapper that chose what to do based on the platform. I found AMD's implementation pretty easily, but I'm actually working with an Nvidia card in the meantime (and this is the more

OpenCL autocorrelation kernel

阅读更多关于 OpenCL autocorrelation kernel

问题 I have written a simple program that does autocorrelation as follows...I've used pgi accelerator directives to move the computation to GPUs. //autocorrelation void autocorr(float *restrict A, float *restrict C, int N) { int i, j; float sum; #pragma acc region { for (i = 0; i < N; i++) { sum = 0.0; for (j = 0; j < N; j++) { if ((i+j) < N) sum += A[j] * A[i+j]; else continue; } C[i] = sum; } } } I wrote a similar program in OpenCL, but I am not getting correct results. The program is as follows

thrust: fill isolate space

阅读更多关于 thrust: fill isolate space

问题 I have an array like this: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0 I want every non-zero elements to expand themselves one element at a time until it reaches other non-zero elements, the result is like this: 1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8 Is there any way to do this using thrust? 回答1: Is there any way to do this using thrust? Yes, here is one possible approach. For each position in the sequence, compute 2 distances. The first is the distance to the nearest non-zero value in the left

Using clCreateSubBuffer

阅读更多关于 Using clCreateSubBuffer

问题 I am trying to use create a subBuffer to read in a chunk of the buffer created from a 1-D vector. This is the code I am using: d_treeArray = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_uint)*total,NULL,&err); cl_buffer_region region; region.origin = 0; // This works //region.origin = 4; // This doesnt work region.size = 10*sizeof(cl_uint); d_subtreeArray = clCreateSubBuffer(d_treeArray,CL_MEM_READ_WRITE,CL_BUFFER_CREATE_TYPE_REGION, &region, &err); if(err != CL_SUCCESS) { std::cout <

Building Tensorflow with OpenCL support fails on Ubuntu 18.04

阅读更多关于 Building Tensorflow with OpenCL support fails on Ubuntu 18.04

问题 While trying to compile Tensorflow on Ubuntu 18.04 with this configuration I'm running into this error: ERROR: /home/joao/Documents/playground/tensorflow/tensorflow/contrib/tensor_forest/hybrid/BUILD:72:1: C++ compilation of rule '//tensorflow/contrib/tensor_forest/hybrid:utils' failed (Exit 1) In file included from tensorflow/contrib/tensor_forest/hybrid/core/ops/utils.cc:15: In file included from ./tensorflow/contrib/tensor_forest/hybrid/core/ops/utils.h:20: In file included from .

Dynamic global memory allocation in opencl kernel

阅读更多关于 Dynamic global memory allocation in opencl kernel

问题 Is it possible to dynamically allocate global memory from the kernel? In CUDA it is possible but I would like to know if this is also possible in OpenCL on Intel GPUs. for example: __kernel void foo() { , , , call malloc or clCreateBuffer here } is it possible? If yes how exactly? 回答1: No, this is not currently allowed in OpenCL. You could implement your own heap by creating one very large buffer up front, and then 'allocate' regions of the buffer by handing out offsets (using atomic_add to

Compile and build .cl file using NVIDIA's nvcc Compiler?

阅读更多关于 Compile and build .cl file using NVIDIA's nvcc Compiler?

Is it possible to compile .cl file using NVIDIA's nvcc compiler?? I am trying to set up visual studio 2010 to code Opencl under CUDA platform. But when I select CUDA C/C++ Compiler to compile and build .cl file, it gives me errors like nvcc does not exist. What is the issue? You should be able to use nvcc to compile OpenCL codes. Normally, I would suggest using a filename extension of .c for a C-compliant code, and .cpp for a C++ compliant code(*), however nvcc has filename extension override options ( -x ... ) so that we can modify the behavior. Here is a worked example using CUDA 8.0.61,

Affect of local_work_size on performance and why it is

阅读更多关于 Affect of local_work_size on performance and why it is

Hello Everyone.... i am new to opencl and trying to explore more @ it. What is the work of local_work_size in openCL program and how it matters in performance. I am working on some image processing algo and for my openCL kernel i gave as size_t local_item_size = 1; size_t global_item_size = (int) (ceil((float)(D_can_width*D_can_height)/local_item_size))*local_item_size; // Process the entire lists ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL); and for same kernel when i changed size_t local_item_size = 16; keeping everything

MandelBrot set Using openCL

阅读更多关于 MandelBrot set Using openCL

问题 Trying to use the same code (sort of) as what I have used when running using TBB (threading building blocks). I don't have a great deal of experience with OpenCL, but I think most of the main code is correct. I believe the errors are in the .cl file, where it does the math. Here is my mandelbrot code in TBB: Mandelbrot TBB Here is my code in OpenCL Mandelbrot OpenCL Any help would be greatly appreciated. 回答1: I changed the code in the kernel, and it ran fine. My new kernel code is the