gpgpu

CUDA: Does passing arguments to a kernel slow the kernel launch much?

寵の児 提交于 2019-12-04 07:05:56
CUDA beginner here. In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch. My kernel launches look something like this: MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x); So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower. The arguments to the kernel are the same every single time, so perhaps i could save time by

Is there an efficient way to optimize my serialized code?

我只是一个虾纸丫 提交于 2019-12-04 06:22:24
问题 This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth? I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below. while (i < n) { if (array[i] != NULL) { return array[i]; } i++; }

How can Opengl Es be use for gpgpu implementation

北战南征 提交于 2019-12-04 05:59:39
I want to use Opengl Es for gpgpu implementation of an image processing code. I want to know can I use Opengl Es for this purpose. If I can than which version of Opengl Es will be more appropriate for this purpose (Opengl Es 1.1 or 2.0). OpenGL ES is a graphics technology for embedded systems, and therefor not quite as powerful as it's bigger brother. OpenGL ES was not designed with doing gpgpu processing in mind, but some algorithms, especially those that work on images and require per-pixel processing can be implemented. However for real GPGPU programming you should consider OpenCL, Nvidia

Set max CUDA resources

被刻印的时光 ゝ 提交于 2019-12-04 05:51:27
问题 I am wondering if it is possible to set the max GPU resources of a CUDA application? For example If I had a 4GB GPU but wanted a given application to only be able to access 2GB of it, and fail if it tries to allocate more. Ideally this could either be set on a process level or on a CUDA context level. 回答1: No, there are no API, process, or driver controls which allow that kind of resource management at present. 来源: https://stackoverflow.com/questions/38427369/set-max-cuda-resources

OpenGL Compute Shader Invocations

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-04 05:38:15
I got a question related to the new compute shaders. I am currently working on a particle system. I store all my particles in shader-storage-buffer to access them in the compute shader. Then I dispatch an one dimensional work group. #define WORK_GROUP_SIZE 128 _shaderManager->useProgram("computeProg"); glDispatchCompute((_numParticles/WORK_GROUP_SIZE), 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); My compute shader: #version 430 struct particle{ vec4 currentPos; vec4 oldPos; }; layout(std430, binding=0) buffer particles{ struct particle p[]; }; layout (local_size_x = 128, local_size_y

CUDA pow function with integer arguments

元气小坏坏 提交于 2019-12-04 05:09:30
问题 I'm new in CUDA, and cannot understand what I'm doing wrong. I'm trying to calculate the distance of object it has id in array, axis x in array and axis y in array to find neighbors for each object __global__ void dist(int *id_d, int *x_d, int *y_d, int *dist_dev, int dimBlock, int i) { int idx = threadIdx.x + blockIdx.x*blockDim.x; while(idx < dimBlock){ int i; for(i= 0; i< dimBlock; i++){ if (idx == i)continue; dist_dev[idx] = pow(x_d[idx] - x_d[i], 2) + pow(y_d[idx] - y_d[i], 2); // error

Parameters to CUDA kernels

空扰寡人 提交于 2019-12-04 03:54:25
When invoking a CUDA kernel for a specific thread configuration, are there any strict rules on which memory space (device/host) kernel parameters should reside in and what type they should be? Suppose I launch a 1-D grid of threads with kernel<<<numblocks, threadsperblock >>> (/*parameters*/) Can I pass an integer parameter int foo which is a host -integer variable, directly to the CUDA kernel? Or should I cudaMalloc memory for a single integer say dev_foo and then cudaMemcpy foo into devfoo and then pass devfoo as a kernel parameter? Yappie The rules for kernel arguments are a logical

CUDA Thrust reduction with double2 arrays

烈酒焚心 提交于 2019-12-04 03:50:16
问题 I have the following (compilable and executable) code using CUDA Thrust to perform reductions of float2 arrays. It works correctly using namespace std; // includes, system #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include <conio.h> #include <typeinfo> #include <iostream> // includes CUDA #include <cuda.h> #include <cuda_runtime.h> // includes Thrust #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> // float2 +

How to check for GPU on CentOS Linux

£可爱£侵袭症+ 提交于 2019-12-04 01:20:27
It is suggested that on Linux, GPU be found with the command lspci | grep VGA . It works fine on Ubuntu but when I try to use the same on CentOS, it says lspci command is not found. How can I check for the GPU card on CentOS. And note that I'm not the administrator of the machine and I only use it remotely from command line. I intend to use the GPU as a GPGPU on that machine, but first I need to check if it even has one. Have you tried to launch /sbin/lspci or /usr/sbin/lspci ? This assumes you have proprietary drivers installed, but issue the following command... nvidia-smi The output should

Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

二次信任 提交于 2019-12-03 22:45:08
Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions: How to evaluate CUDA performance? How Do You Profile & Optimize CUDA Kernels? How to calculate Gflops of a kernel Counting FLOPS/GFLOPS in program - CUDA How to calculate the achieved bandwidth of a CUDA kernel Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and