opencl

How to use _mm256_log_ps by leveraging Intel OpenCL SVML?

谁都会走 提交于 2019-12-11 15:17:01
问题 I found that _mm256_log_ps can't be used with GCC7. Most common suggestions on stackoverflow is to use ICC or leveraging OpenCL SDK. After downloading SDK and extracting RPM file, there are three .so files: __ocl_svml_l9.so, __ocl_svml_e9.so, __ocl_svml_h8.so Can someone teach me how to call _mm256_log_ps with these .so files? Thank you. 回答1: You can use the log function from the Eigen library: #include <Eigen/Core> void foo(float* data, int size) { Eigen::Map<Eigen::ArrayXf> arr(data, size);

OpenCL Cloo Null referenece exception when compiling kernel

淺唱寂寞╮ 提交于 2019-12-11 14:36:20
问题 I'm trying to compile a program to run some opencl code using Cloo. static void Main(string[] arg) { CLCalc.InitCL(); string src = File.ReadAllText("kernels.cl"); CLCalc.Program.Compile(src); } The program was previously working fine, but now I get a null reference exception. At first I thought there was a bug in my opencl code but I replaced the opencl with a very simple kernel and it still would not compile. ie: static void Main(string[] arg) { CLCalc.InitCL(); //string src = File

Understanding work-items and work-groups

若如初见. 提交于 2019-12-11 13:58:37
问题 Based on my previous question: I'm still trying to copy an image (no practical reason, just to start with an easy one): The image contains 200 * 300 == 60000 pixels. The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE . kernel1: std::string kernelCode = "void kernel copy(global const int* image, global int* result)" "{" "result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];" "}"; queue: for

OpenCL & Xcode - Incorrect kernel header being generated for custom data type argument

谁说我不能喝 提交于 2019-12-11 12:33:02
问题 I'm parallelising a LBM using OpenCL and have across a problem regarding how the kernel header files are being generated for a custom data type as an argument to the kernel. I define the data type within the kernel file ( rebound.cl ) as required ( typedef struct {...} t_speed; ) and the data type t_speed is generated in the header file which is obviously syntactically incorrect and the build subsequently fails. Whilst this is more of an annoyance than a major problem, fixing it would save a

How to check is bit value equals to 1?

跟風遠走 提交于 2019-12-11 12:18:52
问题 I have running OpenCL kernel. At some moment I want to check is bit at selected position in variable equals to one or not? For example, I found that in C# we can convert uint to string containing bits using code like this: // Here we will store our bit string bit; // Unsigned integer that will be converted uint hex = 0xfffffffe; // Perform conversion and store our string containing bits string str = uint2bits(hex); // Print all bits "11111111111111111111111111111110" Console.WriteLine(str); /

PyOpenCL returns errors the first run, then only 'invalid program' errors; examples also not working

家住魔仙堡 提交于 2019-12-11 11:57:23
问题 I am trying to run an OpenCL kernel using the pyOpenCL bindings, to run on the GPU. I was trying to load the kernel to my program. I ran my program once and got an error. I ran it again without changing the code and got a different, 'invalid program' error. This keeps happening to my own programs using pyOpenCL and also on example programs. I am able to use OpenCL through the C++ bindings, on both the CPU and GPU, with no problems. So I think this is a problem specific to the pyOpenCL

OpenCL efficient way to group a lower triangular matrix

冷暖自知 提交于 2019-12-11 11:43:51
问题 I'm sure someone has come across this problem before, basically I have a 2D optimisation grid NxM in size, with the constraint that n_i <= m_i , i.e I only want to calculate the pairs in the lower triangular section of the matrix. At the moment I naively just implement all NxM combinations in a N local groups of M work groups (and then use localGroupID and workGroupID to identify the pair), and then return -inf if the constraint fails to save computation. But is there a better way to set up

OpenCL and CUDA registers usage optimization

大憨熊 提交于 2019-12-11 11:38:07
问题 I'm currently writing an OpenCL kernel (but I suppose that in CUDA in will be the same), and currently I try to optimize for NVidia GPU. I currently use 63 registers in my kernel, this kernel is very big and so it use all the GPU registers. I'm looking for some way to: 1) See which variables are in registers and which are then in global memory (Because if I have not enough registers it seems the compiler save the variables in global memory). 2) Is there a way to specify which variable is more

opencl- parallel reduction without local memory

不羁的心 提交于 2019-12-11 11:12:47
问题 Most of the algorithms for parallel reduction uses shared(local) memory. Nvidia,AMD, Intel and so on. But if devices has doesn't have shared(local) memory. How can I do it? If i use same algorithms but store temporary value on global memory, is it gonna be work fine? 回答1: If I think about it, my comment already was the complete answer. Yes, you can use global memory as a replacement for local memory but: you have to allocate enough global memory for all workgroups and assign the workgroups

segmentation fault(core dumped) in opencl

删除回忆录丶 提交于 2019-12-11 10:19:02
问题 I am very new to OpenCL but I have been doing parallel programming for more than a year now. I was making my 1st openCL code (matrix multiplication ). I wrote the following code, //#include<stdio.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <string.h> #include <SDKCommon.hpp> #include <SDKApplication.hpp> #include <SDKCommandArgs.hpp> #include <SDKFile.hpp> #include <CL/cl.h> #define MAX_SOURCE_SIZE (0x100000) #define MATSIZE 16 void initmat(float *Aa,float *Bb,float