opencl

What is the best way to implement a small lookup table in an OpenCL Kernel

时间秒杀一切 提交于 2019-12-21 12:37:43
问题 In my kernel it is necessary to make a large number of random accesses to a small lookup table (only 8 32-bit integers). Each kernel has a unique lookup table. Below is a simplified version of the kernel to illustrate how the lookup table is used. __kernel void some_kernel( __global uint* global_table, __global uint* X, __global uint* Y) { size_t gsi = get_global_size(0); size_t gid = get_global_id(0); __private uint LUT[8]; // 8 words of of global_table is copied to LUT // Y is assigned a

A nice starter kit for OpenCL? [closed]

雨燕双飞 提交于 2019-12-21 09:38:00
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I've got some experience with OpenGL and it's programmable pipeline. I'd like to give OpenCL a try, though. Could somebody propose a nice integrated kit for working with OpenCL ? I know only of QuartzComposer which looks nice, but it's mac-only. Anyone knows if it supports hand-editing of OpenCL kernels or is it

Best approach to FIFO implementation in a kernel OpenCL

青春壹個敷衍的年華 提交于 2019-12-21 05:46:15
问题 Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here). I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well). Description of the picture: One at a time , the values

Create local array dynamic inside OpenCL kernel

好久不见. 提交于 2019-12-21 04:59:11
问题 I have a OpenCL kernel that needs to process a array as multiple arrays where each sub-array sum is saved in a local cache array. For example, imagine the fowling array: [[1, 2, 3, 4], [10, 30, 1, 23]] Each work-group gets a array (in the exemple we have 2 work-groups); Each work-item process two array indexes (for example multiply the value index the local_id), where the work-item result is saved in a work-group shared array. __kernel void test(__global int **values, __global int *result,

Python pyopencl DLL load failed even with latest drivers

旧街凉风 提交于 2019-12-21 04:14:32
问题 I've installed the latest CUDA and driver for my GPU. I'm using Python 2.7.10 on Win7 64bit. I tried installing pyopencl from: a . the unofficial windows binaries at http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyopencl b . by compiling my own after getting the sources from https://pypi.python.org/pypi/pyopencl The installation was successful on both cases but I get the same error message once I try to import it: >>> import pyopencl Traceback (most recent call last): File "<stdin>", line 1, in

Performance: boost.compute v.s. opencl c++ wrapper

ⅰ亾dé卋堺 提交于 2019-12-21 04:13:41
问题 The following codes add two vectors using boost.compute and opencl c++ wrapper respectively. The result shows boost.compute is almost 20 times slower than the opencl c++ wrapper. I wonder if I miss use boost.compute or it is indeed slow. Platform: win7, vs2013, boost 1.55, boost.compute 0.2, ATI Radeon HD 4600 Code uses the c++ wrapper: #define __CL_ENABLE_EXCEPTIONS #include <CL/cl.hpp> #include <boost/timer/timer.hpp> #include <boost/smart_ptr/scoped_array.hpp> #include <fstream> #include

Performance: boost.compute v.s. opencl c++ wrapper

喜欢而已 提交于 2019-12-21 04:13:07
问题 The following codes add two vectors using boost.compute and opencl c++ wrapper respectively. The result shows boost.compute is almost 20 times slower than the opencl c++ wrapper. I wonder if I miss use boost.compute or it is indeed slow. Platform: win7, vs2013, boost 1.55, boost.compute 0.2, ATI Radeon HD 4600 Code uses the c++ wrapper: #define __CL_ENABLE_EXCEPTIONS #include <CL/cl.hpp> #include <boost/timer/timer.hpp> #include <boost/smart_ptr/scoped_array.hpp> #include <fstream> #include

OpenCL for Python

瘦欲@ 提交于 2019-12-21 03:32:53
问题 I'm looking for a good OpenCL wrapper\library for Python, with good documentation. I tried to search some... but couldn't find one good enough. 回答1: The most popular and best documented option seems to be PyOpenCL. It claims to be a complete wrapper for OpenCL and the documentation looks good. 回答2: Both CLyther and PyOpenCL look nicely documented to me. 回答3: pycl is a ctypes binding to OpenCL (hosted on bitbucket) Its primary goal is simple: wrap OpenCL in such a way that as many Python

How many threads (or work-item) can run at the same time?

末鹿安然 提交于 2019-12-21 03:21:08
问题 I'm new in GPGPU programming and I'm working with NVIDIA implementation of OpenCL. My question was how to compute the limit of a GPU device (in number of threads). From what I understood a there are a number of work-group (equivalent of blocks in CUDA) that contain a number of work-item (~ cuda thread). How do I get the number of work-group present on my card (and that can run at the same time) and the number of work-item present on one work group? To what CL_DEVICE_MAX_COMPUTE_UNITS

How many threads (or work-item) can run at the same time?

喜夏-厌秋 提交于 2019-12-21 03:21:07
问题 I'm new in GPGPU programming and I'm working with NVIDIA implementation of OpenCL. My question was how to compute the limit of a GPU device (in number of threads). From what I understood a there are a number of work-group (equivalent of blocks in CUDA) that contain a number of work-item (~ cuda thread). How do I get the number of work-group present on my card (and that can run at the same time) and the number of work-item present on one work group? To what CL_DEVICE_MAX_COMPUTE_UNITS