opencl | 易学教程

Is there a way I can create an array of cl::sycl::pipe?

阅读更多关于 Is there a way I can create an array of cl::sycl::pipe?

问题 I am using the Xilinx's triSYCL github implementation,https://github.com/triSYCL/triSYCL. I am trying to create a design with 100 cl::sycl::pipes each with capacity= 6 . And I am gonna access each pipe through a separate thread in my SYCL code. Here is what I tried: constexpr int T = 6; constexpr int n_threads = 100; cl::sycl::pipe<cl::sycl::pipe<float>> p { n_threads, cl::sycl::pipe<float> { T } }; for (int j=0; j<n_threads; j++) { q.submit([&](cl::sycl::handler &cgh) { // Get write access

How to avoid constant memory copying in OpenCL

阅读更多关于 How to avoid constant memory copying in OpenCL

问题 I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing. OpenCL kernel is taking two-dimensional ( n x n ) array of temperatures values and its size ( n ). It returns new array with temperatures after each cycle: pseudocode: int t_id = get_global_id(0); if(t_id < n * n) { m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures } As You can see, every thread is computing single cell in matrix. When host

OpenCL: comparing the time required to add two arrays of integers on available platforms/devices

阅读更多关于 OpenCL: comparing the time required to add two arrays of integers on available platforms/devices

问题 I'm very new to the whole OpenCL world so I'm following some beginners tutorials. I'm trying to combine this and this to compare the the time required to add two arrays together on different devices. However I'm getting confusing results. Considering that the code is too long I made this GitHub Gist. On my mac I have 1 platform with 3 devices. When I assign the j in cl_command_queue command_queue = clCreateCommandQueue(context, device_id[j], 0, &ret); manually to 0 it seems to run the

Read/Write OpenCL memory buffers on multiple GPU in a single context

阅读更多关于 Read/Write OpenCL memory buffers on multiple GPU in a single context

问题 Assume a system with two distinct GPUs, but from the same vendor so they can be accessed from a single OpenCL Platform. Given the following simplified OpenCL code: float* someRawData; cl_device_id gpu1 = clGetDeviceIDs(0,...); cl_device_id gpu2 = clGetDeviceIDs(1,...); cl_context ctx = clCreateContext(gpu1,gpu2,...); cl_command_queue queue1 = clCreateCommandQueue(ctx,gpu1,...); cl_command_queue queue2 = clCreateCommandQueue(ctx,gpu2,...); cl_mem gpuMem = clCreateBuffer(ctx, CL_MEM_READ_WRITE,

Is there an opencl profiler for mac os X 10.8?

阅读更多关于 Is there an opencl profiler for mac os X 10.8?

问题 I am trying to find the bottleneck in my OpenCL kernel, is it possible to profile OpenCL programms on mac os X? I found gDebugger on http://www.gremedy.com/, but it requires 10.5 or 10.6 to run. AMD SDK supports only Linux and Windows. Is there a profiler for Mountain Lion? 回答1: How detailed must your profiling information be? Is it okay to use the built-in internal profiler? OpenCL queues can be created with the CL_QUEUE_PROFILING_ENABLE flag. This way you can see for each kernel you

Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

阅读更多关于 Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

问题 I have used the following command to profile my Python code : python2.7 -m cProfile -o X2_non_flat_multiprocessing_dummy.prof X2_non_flat.py Then, I can visualize globally the repartition of different greedy functions : As you can see, a lot of time is spent into Pobs_C and interpolate routine which corresponds to the following code snippet : def Pobs_C(z, zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T, R_T, DG_T_fid, DG_T, WGT_T, WT_T, WIAT_T, cl, P_dd_spec, RT500): cc

“Compile Server Error.” while building OpenCL kernels

阅读更多关于 “Compile Server Error.” while building OpenCL kernels

问题 I am trying to compile OpenCL kernels on OS X. Everything is ok when there are just a few lines. However, after the code grows over 1.5k lines, clGetProgramBuildInfo with CL_PROGRAM_BUILD_LOG flag returned "Compile Server Error." every time. I googled but found nothing about it. Could anyone help me? 回答1: You can learn the meaning of OpenCL error codes by searching in cl.h. In this case, -11 is just what you'd expect, CL_BUILD_PROGRAM_FAILURE. It's certainly curious that the build log is

What is the optimum OpenCL 2 kernel to sum floats?

阅读更多关于 What is the optimum OpenCL 2 kernel to sum floats?

问题 C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2. Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan

Open CL no synchronization despite barrier

阅读更多关于 Open CL no synchronization despite barrier

问题 I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems: __kernel void part1(__global float* a, __global float* c) { unsigned int i = get_global_id(0); c[i] = 0; barrier(CLK_GLOBAL_MEM_FENCE); if (i < 9) { for(int t = 0; t < 2; t++){ c[i] = c[i+1] + a[i]; barrier(CLK

Does OpenCL allow concurrent writes to same memory address?

阅读更多关于 Does OpenCL allow concurrent writes to same memory address?

问题 Is two (or more) different threads allowed to write to the same memory location in global space in OpenCL? The write is always changing a uchar from 0 to 1 so the outcome should be predictable, but I'm getting erratic results in my program, so I'm wondering if the reason can be that some of the writes fail. Could it help to declare the buffer write-only and copy it to a read-only buffer afterwards? 回答1: Did you try to use the cl_khr_global_int32_base_atomics extension and atom_inc intrinsic