opencl

Is there a way I can create an array of cl::sycl::pipe?

二次信任 提交于 2019-12-24 09:51:46
问题 I am using the Xilinx's triSYCL github implementation,https://github.com/triSYCL/triSYCL. I am trying to create a design with 100 cl::sycl::pipes each with capacity= 6 . And I am gonna access each pipe through a separate thread in my SYCL code. Here is what I tried: constexpr int T = 6; constexpr int n_threads = 100; cl::sycl::pipe<cl::sycl::pipe<float>> p { n_threads, cl::sycl::pipe<float> { T } }; for (int j=0; j<n_threads; j++) { q.submit([&](cl::sycl::handler &cgh) { // Get write access

How to avoid constant memory copying in OpenCL

Deadly 提交于 2019-12-24 08:35:18
问题 I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing. OpenCL kernel is taking two-dimensional ( n x n ) array of temperatures values and its size ( n ). It returns new array with temperatures after each cycle: pseudocode: int t_id = get_global_id(0); if(t_id < n * n) { m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures } As You can see, every thread is computing single cell in matrix. When host

OpenCL: comparing the time required to add two arrays of integers on available platforms/devices

末鹿安然 提交于 2019-12-24 07:29:18
问题 I'm very new to the whole OpenCL world so I'm following some beginners tutorials. I'm trying to combine this and this to compare the the time required to add two arrays together on different devices. However I'm getting confusing results. Considering that the code is too long I made this GitHub Gist. On my mac I have 1 platform with 3 devices. When I assign the j in cl_command_queue command_queue = clCreateCommandQueue(context, device_id[j], 0, &ret); manually to 0 it seems to run the

Read/Write OpenCL memory buffers on multiple GPU in a single context

[亡魂溺海] 提交于 2019-12-24 06:35:16
问题 Assume a system with two distinct GPUs, but from the same vendor so they can be accessed from a single OpenCL Platform. Given the following simplified OpenCL code: float* someRawData; cl_device_id gpu1 = clGetDeviceIDs(0,...); cl_device_id gpu2 = clGetDeviceIDs(1,...); cl_context ctx = clCreateContext(gpu1,gpu2,...); cl_command_queue queue1 = clCreateCommandQueue(ctx,gpu1,...); cl_command_queue queue2 = clCreateCommandQueue(ctx,gpu2,...); cl_mem gpuMem = clCreateBuffer(ctx, CL_MEM_READ_WRITE,

Is there an opencl profiler for mac os X 10.8?

有些话、适合烂在心里 提交于 2019-12-24 03:56:08
问题 I am trying to find the bottleneck in my OpenCL kernel, is it possible to profile OpenCL programms on mac os X? I found gDebugger on http://www.gremedy.com/, but it requires 10.5 or 10.6 to run. AMD SDK supports only Linux and Windows. Is there a profiler for Mountain Lion? 回答1: How detailed must your profiling information be? Is it okay to use the built-in internal profiler? OpenCL queues can be created with the CL_QUEUE_PROFILING_ENABLE flag. This way you can see for each kernel you

Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

天涯浪子 提交于 2019-12-24 03:48:09
问题 I have used the following command to profile my Python code : python2.7 -m cProfile -o X2_non_flat_multiprocessing_dummy.prof X2_non_flat.py Then, I can visualize globally the repartition of different greedy functions : As you can see, a lot of time is spent into Pobs_C and interpolate routine which corresponds to the following code snippet : def Pobs_C(z, zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T, R_T, DG_T_fid, DG_T, WGT_T, WT_T, WIAT_T, cl, P_dd_spec, RT500): cc

“Compile Server Error.” while building OpenCL kernels

久未见 提交于 2019-12-24 03:44:07
问题 I am trying to compile OpenCL kernels on OS X. Everything is ok when there are just a few lines. However, after the code grows over 1.5k lines, clGetProgramBuildInfo with CL_PROGRAM_BUILD_LOG flag returned "Compile Server Error." every time. I googled but found nothing about it. Could anyone help me? 回答1: You can learn the meaning of OpenCL error codes by searching in cl.h. In this case, -11 is just what you'd expect, CL_BUILD_PROGRAM_FAILURE. It's certainly curious that the build log is

What is the optimum OpenCL 2 kernel to sum floats?

空扰寡人 提交于 2019-12-24 03:32:36
问题 C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2. Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan

Open CL no synchronization despite barrier

不问归期 提交于 2019-12-24 03:09:27
问题 I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems: __kernel void part1(__global float* a, __global float* c) { unsigned int i = get_global_id(0); c[i] = 0; barrier(CLK_GLOBAL_MEM_FENCE); if (i < 9) { for(int t = 0; t < 2; t++){ c[i] = c[i+1] + a[i]; barrier(CLK

Does OpenCL allow concurrent writes to same memory address?

强颜欢笑 提交于 2019-12-24 00:49:38
问题 Is two (or more) different threads allowed to write to the same memory location in global space in OpenCL? The write is always changing a uchar from 0 to 1 so the outcome should be predictable, but I'm getting erratic results in my program, so I'm wondering if the reason can be that some of the writes fail. Could it help to declare the buffer write-only and copy it to a read-only buffer afterwards? 回答1: Did you try to use the cl_khr_global_int32_base_atomics extension and atom_inc intrinsic