opencl

Reduction with OpenMP: linear merging or log(number of threads) merging

别来无恙 提交于 2019-12-02 04:04:59
I have a general question about reductions with OpenMP that's bothered me for a while. My question is in regards to merging the partial sums in a reduction. It can either be done linearly or as the log of the number of threads. Let's assume I want to do a reduction of some function double foo(int i) . With OpenMP I could do it like this. double sum = 0.0; #pragma omp parallel for reduction (+:sum) for(int i=0; i<n; i++) { sum += f(i); } However, I claim that the following code will be just as efficient. double sum = 0.0; #pragma omp parallel { double sum_private = 0.0; #pragma omp for nowait

Passing two options as arguments in OpenCL with Fortran (CLFORTRAN)

主宰稳场 提交于 2019-12-02 03:47:39
When my host program is in C language I can pass two options as an argument of an OpenCL function. For example, I can pass two flags to the clCreateBuffer function like this: clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(main_data), main_data, &err) However, when I try to do the same in a host program written in Fortran: main_data=clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, & sizeof(main_data), C_NULL_PTR, err) I get an error: & |CL_MEM_COPY_HOST_PTR, size_in_bytes,C_NULL_PTR,ierr) 1 Error: Syntax error in argument list at (1) I have successfully

Enable/disable Optimus/Enduro in cross platform manner

北城余情 提交于 2019-12-02 03:44:42
In order to save power it is common in recent graphics architectures to dynamically switch between a discrete high-performance and an integrated lower-performance GPU, where the high-performance GPU is only enabled when the need for extra performance is present. This technology is branded as nvidia Optimus and AMD Enduro for the two main GPU vendors. However due to the non-standardized way in which these technologies work, managing them from a developer's perspective can be a nightmare. For example in this PDF from nvidia on the subject, they explain the many intricacies, limitations and

OpenCL kernel error on Mac OSx

为君一笑 提交于 2019-12-02 02:24:05
I wrote some OpenCL code which works fine on LINUX, but it is failing with errors on Mac OSX. Can someone please help me to identify why these should occur. The kernel code is shown after the error. My kernel uses double, so I have the corresponding pragma at the top. But I don't know why the error shows float data type: inline float8 __OVERLOAD__ _name(float8 x) { return _default_name(x); } \ ^ /System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:4606:30: note: candidate function __CLFN_FD_1FD_FAST_RELAX(__fast_relax_log, native_log, __cl_log); ^ /System

CL_DEVICE_NOT_AVAILABLE using Intel Core 2 Duo E8500 CPU

送分小仙女□ 提交于 2019-12-02 01:05:36
I get the error CL_DEVICE_NOT_AVAILABLE when running this sample code . However, unlike in that question, my CPU, the Intel Core 2 Duo E8500 CPU, appears to be supported . I've made sure to link against the Intel version of the OpenCL libraries, since I also have the Nvidia libraries installed. Why is this error occurring? "CL_DEVICE_NOT_AVAILABLE" has nothing to do with the SDK. It's due to the OpenCL device driver which is part of the video card device driver. It's common to confuse the SDK and the OpenCL device driver. You develop the host code with the SDK but the kernel is compiled and

Error CL_DEVICE_NOT_AVAILABLE when calling clCreateContext (Intel Core2Duo, Intel OCL SDK 3.0 beta)

时光怂恿深爱的人放手 提交于 2019-12-02 00:42:25
问题 I'm trying to get started with OpenCL (Intel opencl-1.2-3.0.56860). I managed to install the OpenCL SDK from Intel under Ubuntu 12.05 (using "alien" to convert the rpm packages to *.deb packages). Now I try to get my first simple OpenCL program running... To run the program I need to use set the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=/opt/intel/opencl/lib64/ My Problem is, that I always get the error "CL_DEVICE_NOT_AVAILABLE" when calling clCreateContext(...). Here is my source code:

Is the access performance of __constant memory as same as __global memory on OpenCL

心已入冬 提交于 2019-12-01 23:05:43
As I know. Constant memory on CUDA is a specific memory. And it is faster than global memory. But in OpenCL's Spec. I get the following words. The __constant or constant address space name is used to describe variables allocated in global memory and which are accessed inside a kernel(s) as read-only variables So the __constant memory is from the __global memory. Does that mean it have the same accessing performance with the __global memory? It depends on the hardware and software architecture of the OpenCL platform you are using. For example, one can envision an architecture with read-only

What is/are the fastest memset() alternatives for OpenCL?

跟風遠走 提交于 2019-12-01 22:29:16
问题 I'm using OpenCL, and I need to memset() some array in global device memory. CUDA has a memset() -like API function, but OpenCL does not. I read this, where I found two possible alternatives: using memset() on the host with some scratch buffer, then clEnqueueWriteBuffer() to copy that to the buffer on the device. Enqueueing (sp?) the following kernel: __kernel void memset_uint4(__global uint4* mem,__private uint4 val) { mem[get_global_id(0)]=val; } Which is better? Or rather, under which

OpenCL select/delete points from large array

此生再无相见时 提交于 2019-12-01 21:47:21
I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry. How can I do this within an OpenCL kernel process? I can already: identify those points that fall outside the triangle (simple point in poly algorithm in the kernel) pass their coordinates to a global output array. But : an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size As a result, 0,0 points occur in the final output when a

What is/are the fastest memset() alternatives for OpenCL?

情到浓时终转凉″ 提交于 2019-12-01 21:39:20
I'm using OpenCL, and I need to memset() some array in global device memory. CUDA has a memset() -like API function, but OpenCL does not. I read this , where I found two possible alternatives: using memset() on the host with some scratch buffer, then clEnqueueWriteBuffer() to copy that to the buffer on the device. Enqueueing (sp?) the following kernel: __kernel void memset_uint4(__global uint4* mem,__private uint4 val) { mem[get_global_id(0)]=val; } Which is better? Or rather, under which circumstances/for which platforms is one better than the other? Note: If the special case of zero'ing