opencl | 易学教程

How to use _mm256_log_ps by leveraging Intel OpenCL SVML?

阅读更多关于 How to use _mm256_log_ps by leveraging Intel OpenCL SVML?

问题 I found that _mm256_log_ps can't be used with GCC7. Most common suggestions on stackoverflow is to use ICC or leveraging OpenCL SDK. After downloading SDK and extracting RPM file, there are three .so files: __ocl_svml_l9.so, __ocl_svml_e9.so, __ocl_svml_h8.so Can someone teach me how to call _mm256_log_ps with these .so files? Thank you. 回答1: You can use the log function from the Eigen library: #include <Eigen/Core> void foo(float* data, int size) { Eigen::Map<Eigen::ArrayXf> arr(data, size);

OpenCL Cloo Null referenece exception when compiling kernel

阅读更多关于 OpenCL Cloo Null referenece exception when compiling kernel

问题 I'm trying to compile a program to run some opencl code using Cloo. static void Main(string[] arg) { CLCalc.InitCL(); string src = File.ReadAllText("kernels.cl"); CLCalc.Program.Compile(src); } The program was previously working fine, but now I get a null reference exception. At first I thought there was a bug in my opencl code but I replaced the opencl with a very simple kernel and it still would not compile. ie: static void Main(string[] arg) { CLCalc.InitCL(); //string src = File

Understanding work-items and work-groups

阅读更多关于 Understanding work-items and work-groups

问题 Based on my previous question: I'm still trying to copy an image (no practical reason, just to start with an easy one): The image contains 200 * 300 == 60000 pixels. The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE . kernel1: std::string kernelCode = "void kernel copy(global const int* image, global int* result)" "{" "result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];" "}"; queue: for

OpenCL & Xcode - Incorrect kernel header being generated for custom data type argument

阅读更多关于 OpenCL & Xcode - Incorrect kernel header being generated for custom data type argument

问题 I'm parallelising a LBM using OpenCL and have across a problem regarding how the kernel header files are being generated for a custom data type as an argument to the kernel. I define the data type within the kernel file ( rebound.cl ) as required ( typedef struct {...} t_speed; ) and the data type t_speed is generated in the header file which is obviously syntactically incorrect and the build subsequently fails. Whilst this is more of an annoyance than a major problem, fixing it would save a

How to check is bit value equals to 1?

阅读更多关于 How to check is bit value equals to 1?

问题 I have running OpenCL kernel. At some moment I want to check is bit at selected position in variable equals to one or not? For example, I found that in C# we can convert uint to string containing bits using code like this: // Here we will store our bit string bit; // Unsigned integer that will be converted uint hex = 0xfffffffe; // Perform conversion and store our string containing bits string str = uint2bits(hex); // Print all bits "11111111111111111111111111111110" Console.WriteLine(str); /

PyOpenCL returns errors the first run, then only 'invalid program' errors; examples also not working

阅读更多关于 PyOpenCL returns errors the first run, then only 'invalid program' errors; examples also not working

问题 I am trying to run an OpenCL kernel using the pyOpenCL bindings, to run on the GPU. I was trying to load the kernel to my program. I ran my program once and got an error. I ran it again without changing the code and got a different, 'invalid program' error. This keeps happening to my own programs using pyOpenCL and also on example programs. I am able to use OpenCL through the C++ bindings, on both the CPU and GPU, with no problems. So I think this is a problem specific to the pyOpenCL

OpenCL efficient way to group a lower triangular matrix

阅读更多关于 OpenCL efficient way to group a lower triangular matrix

问题 I'm sure someone has come across this problem before, basically I have a 2D optimisation grid NxM in size, with the constraint that n_i <= m_i , i.e I only want to calculate the pairs in the lower triangular section of the matrix. At the moment I naively just implement all NxM combinations in a N local groups of M work groups (and then use localGroupID and workGroupID to identify the pair), and then return -inf if the constraint fails to save computation. But is there a better way to set up

OpenCL and CUDA registers usage optimization

阅读更多关于 OpenCL and CUDA registers usage optimization

问题 I'm currently writing an OpenCL kernel (but I suppose that in CUDA in will be the same), and currently I try to optimize for NVidia GPU. I currently use 63 registers in my kernel, this kernel is very big and so it use all the GPU registers. I'm looking for some way to: 1) See which variables are in registers and which are then in global memory (Because if I have not enough registers it seems the compiler save the variables in global memory). 2) Is there a way to specify which variable is more

opencl- parallel reduction without local memory

阅读更多关于 opencl- parallel reduction without local memory

问题 Most of the algorithms for parallel reduction uses shared(local) memory. Nvidia,AMD, Intel and so on. But if devices has doesn't have shared(local) memory. How can I do it? If i use same algorithms but store temporary value on global memory, is it gonna be work fine? 回答1: If I think about it, my comment already was the complete answer. Yes, you can use global memory as a replacement for local memory but: you have to allocate enough global memory for all workgroups and assign the workgroups

segmentation fault(core dumped) in opencl

阅读更多关于 segmentation fault(core dumped) in opencl

问题 I am very new to OpenCL but I have been doing parallel programming for more than a year now. I was making my 1st openCL code (matrix multiplication ). I wrote the following code, //#include<stdio.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <string.h> #include <SDKCommon.hpp> #include <SDKApplication.hpp> #include <SDKCommandArgs.hpp> #include <SDKFile.hpp> #include <CL/cl.h> #define MAX_SOURCE_SIZE (0x100000) #define MATSIZE 16 void initmat(float *Aa,float *Bb,float