gpgpu | 易学教程

Should I look into PTX to optimize my kernel? If so, how?

阅读更多关于 Should I look into PTX to optimize my kernel? If so, how?

问题 Do you recommend reading your kernel's PTX code to find out to optimize your kernels further? One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code. Are there other use-cases for the PTX code? Do you look into your PTX code? Where can I find out how to be able to read the PTX code CUDA generates for my kernels? 回答1: The first point to make about PTX is that it

SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

阅读更多关于 SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

问题 I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required. Why use SIMD if we have GPGPU? SIMD intrinsics - are they usable on gpus? CPU SIMD vs GPU SIMD? Are following points correct,if I compile the code in SIMD-8 mode ? 1) it means 8 instructions of different work items are getting executing in parallel. 2) Does it mean All work items are executing the same instruction only? 3) if each wrok item code contains

clGetDeviceInfo and clGetPlatformInfo fails in OpenCL with error code -30 (CL_INVALID_VALUE)

阅读更多关于 clGetDeviceInfo and clGetPlatformInfo fails in OpenCL with error code -30 (CL_INVALID_VALUE)

问题 I am starting to write a little "engine" for using OpenCL. Now, I encountered a problem that is quite strange. When I call clGetDeviceInfo() to query informations of the specific device, some of the options for the parameter param_name return the error code -30 ( = CL_INVALID_VALUE). A very famous one is the option CL_DEVICE_EXTENSIONS which should return me a string of extensions no matter what sdk or platform I am using. I checked every edge and also the parameters are double checked.

CUDA to solve many “small/moderate” linear systems

阅读更多关于 CUDA to solve many “small/moderate” linear systems

问题 Some background info on the problem I am trying to speed up using CUDA: I have a large number of small/moderate same-sized linear systems I need to solve independently. Each linear system is square, real, dense, invertible, and non-symmetric. These are actually matrix systems so each system look like, AX = B, where A, X, and B are (n x n) matrixes. In this previous question I ask CUBLAS batch and matrix sizes, where I learn cuBLAS batch operations give best performance for matrix of size

Computing the mean of 2000 2D-arrays with CUDA C

阅读更多关于 Computing the mean of 2000 2D-arrays with CUDA C

问题 I have 2000 2D-arrays ( each array is 1000x1000) I need to computer the mean of each one and put the result in one 2000 vector. I tried to do that by call the kernel for each 2D-array but it is naive I want to do the computation at once. this is what im done is a kernel for one 2D-array. i wanna make kernel to do this for 2000 2D-arrays in one kernel. #include <stdio.h> #include <cuda.h> #include <time.h> void init_mat(float *a, const int N, const int M); void print_mat(float *a, const int N,

Finding Minimum Value in Array and its index using CUDA __shfl_down function

阅读更多关于 Finding Minimum Value in Array and its index using CUDA __shfl_down function

问题 I am writing a function which will find the minimum value and the index at which value was found a 1D array using CUDA. I started by modifying the reduction code for finding sum of values in 1d array. The code work fine for sum function but I am not able to get it work for finding minimum. I am attaching the code in the message if there is any cuda guru please point out the mistake I am doing. Actual function is below and in the test example array size is 1024. So, I am using shuffle

Matlab GPU arrayfun shared variable

阅读更多关于 Matlab GPU arrayfun shared variable

问题 I am using matlab GPU computing with function arrayfun and a gpuArray object to do element-wise function on elements of the gpuArray variable on my function: [ output ] = MyFunc( element, SharedMatrix ) // // Process element with Shared Matrix // end and my code is like so: SharedMatrix = magic(5000); %Large Memory Object SharedMatrix = gpuArray(SharedMatrix); elements = magic(5); gpuElements = gpuArray(elements ); //Error on next line, SharedMatrix object must be a scaler. result = arrayfun(

CUDA Warps and Optimal Number of Threads Per Block

阅读更多关于 CUDA Warps and Optimal Number of Threads Per Block

问题 From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions: 1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that

Information on current GPU Architectures

阅读更多关于 Information on current GPU Architectures

问题 I have decided that my bachelors thesis will be about general purpose GPU-computing and which problems are more suitable for this than others. I am also trying to find out if there are any major differences between the current GPU architectures that may affect this. I am currently looking for some scientific papers and/or information directly from the manufacturers about the current GPU Architectures , but I can't seem to find anything that looks detailed enough. Therefore, I am hoping that

Using CUDA Profiler nvprof for memory accesses

阅读更多关于 Using CUDA Profiler nvprof for memory accesses

问题 I'm using nvprof to get the number of global memory accesses for the following CUDA code. The number of loads in the kernel is 36 (accessing d_In array) and the number of stores in the kernel is 36+36 (for accessing d_Out array and d_rows array). So, the total number of global memory loads is 36 and the number of global memory stores is 72. However, when I profile the code with nvprof CUDA profiler, it reports the following: (Basically I want to compute the Compute to Global Memory Access