gpgpu | 易学教程

CUDA: Does passing arguments to a kernel slow the kernel launch much?

阅读更多关于 CUDA: Does passing arguments to a kernel slow the kernel launch much?

CUDA beginner here. In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch. My kernel launches look something like this: MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x); So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower. The arguments to the kernel are the same every single time, so perhaps i could save time by

Is there an efficient way to optimize my serialized code?

阅读更多关于 Is there an efficient way to optimize my serialized code?

问题 This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth? I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below. while (i < n) { if (array[i] != NULL) { return array[i]; } i++; }

How can Opengl Es be use for gpgpu implementation

阅读更多关于 How can Opengl Es be use for gpgpu implementation

I want to use Opengl Es for gpgpu implementation of an image processing code. I want to know can I use Opengl Es for this purpose. If I can than which version of Opengl Es will be more appropriate for this purpose (Opengl Es 1.1 or 2.0). OpenGL ES is a graphics technology for embedded systems, and therefor not quite as powerful as it's bigger brother. OpenGL ES was not designed with doing gpgpu processing in mind, but some algorithms, especially those that work on images and require per-pixel processing can be implemented. However for real GPGPU programming you should consider OpenCL, Nvidia

Set max CUDA resources

阅读更多关于 Set max CUDA resources

问题 I am wondering if it is possible to set the max GPU resources of a CUDA application? For example If I had a 4GB GPU but wanted a given application to only be able to access 2GB of it, and fail if it tries to allocate more. Ideally this could either be set on a process level or on a CUDA context level. 回答1: No, there are no API, process, or driver controls which allow that kind of resource management at present. 来源： https://stackoverflow.com/questions/38427369/set-max-cuda-resources

OpenGL Compute Shader Invocations

阅读更多关于 OpenGL Compute Shader Invocations

I got a question related to the new compute shaders. I am currently working on a particle system. I store all my particles in shader-storage-buffer to access them in the compute shader. Then I dispatch an one dimensional work group. #define WORK_GROUP_SIZE 128 _shaderManager->useProgram("computeProg"); glDispatchCompute((_numParticles/WORK_GROUP_SIZE), 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); My compute shader: #version 430 struct particle{ vec4 currentPos; vec4 oldPos; }; layout(std430, binding=0) buffer particles{ struct particle p[]; }; layout (local_size_x = 128, local_size_y

CUDA pow function with integer arguments

阅读更多关于 CUDA pow function with integer arguments

问题 I'm new in CUDA, and cannot understand what I'm doing wrong. I'm trying to calculate the distance of object it has id in array, axis x in array and axis y in array to find neighbors for each object __global__ void dist(int *id_d, int *x_d, int *y_d, int *dist_dev, int dimBlock, int i) { int idx = threadIdx.x + blockIdx.x*blockDim.x; while(idx < dimBlock){ int i; for(i= 0; i< dimBlock; i++){ if (idx == i)continue; dist_dev[idx] = pow(x_d[idx] - x_d[i], 2) + pow(y_d[idx] - y_d[i], 2); // error

Parameters to CUDA kernels

阅读更多关于 Parameters to CUDA kernels

When invoking a CUDA kernel for a specific thread configuration, are there any strict rules on which memory space (device/host) kernel parameters should reside in and what type they should be? Suppose I launch a 1-D grid of threads with kernel<<<numblocks, threadsperblock >>> (/*parameters*/) Can I pass an integer parameter int foo which is a host -integer variable, directly to the CUDA kernel? Or should I cudaMalloc memory for a single integer say dev_foo and then cudaMemcpy foo into devfoo and then pass devfoo as a kernel parameter? Yappie The rules for kernel arguments are a logical

CUDA Thrust reduction with double2 arrays

阅读更多关于 CUDA Thrust reduction with double2 arrays

问题 I have the following (compilable and executable) code using CUDA Thrust to perform reductions of float2 arrays. It works correctly using namespace std; // includes, system #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include <conio.h> #include <typeinfo> #include <iostream> // includes CUDA #include <cuda.h> #include <cuda_runtime.h> // includes Thrust #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> // float2 +

How to check for GPU on CentOS Linux

阅读更多关于 How to check for GPU on CentOS Linux

It is suggested that on Linux, GPU be found with the command lspci | grep VGA . It works fine on Ubuntu but when I try to use the same on CentOS, it says lspci command is not found. How can I check for the GPU card on CentOS. And note that I'm not the administrator of the machine and I only use it remotely from command line. I intend to use the GPU as a GPGPU on that machine, but first I need to check if it even has one. Have you tried to launch /sbin/lspci or /usr/sbin/lspci ? This assumes you have proprietary drivers installed, but issue the following command... nvidia-smi The output should

Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

阅读更多关于 Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions: How to evaluate CUDA performance? How Do You Profile & Optimize CUDA Kernels? How to calculate Gflops of a kernel Counting FLOPS/GFLOPS in program - CUDA How to calculate the achieved bandwidth of a CUDA kernel Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and