gpu-programming | 易学教程

GPU gives no performance improvement in Julia set computation

阅读更多关于 GPU gives no performance improvement in Julia set computation

问题 I am trying to compare performance in CPU and GPU. I have CPU : Intel® Core™ i5 CPU M 480 @ 2.67GHz × 4 GPU : NVidia GeForce GT 420M I can confirm that GPU is configured and works correctly with CUDA. I am implementing Julia set computation. http://en.wikipedia.org/wiki/Julia_set Basically for every pixel, if the co-ordinate is in the set it will paint it red else paint it white. Although, I get identical answer with both CPU and GPU but instead of getting a performance improvement, I get a

CUDA-transfer 2D array from host to device

阅读更多关于 CUDA-transfer 2D array from host to device

问题 I have a 2D matrix in the main. I want to transfer if from host to device. Can you tell me how I can allocate memory for it and transfer it to the device memory? #define N 5 __global__ void kernel(int a[N][N]){ } int main(void){ int a[N][N]; cudaMalloc(?); cudaMemcpy(?); kernel<<<N,N>>>(?); } 回答1: Perhaps something like this is what you really had in mind: #define N 5 __global__ void kernel(int *a) { // Thread indexing within Grid - note these are // in column major order. int tidx =

Run OpenCL program on NVIDIA hardware

阅读更多关于 Run OpenCL program on NVIDIA hardware

问题 I've build a simple OpenCL based program (in C++) and tested in on Windows 8 system with AMD FirePro V4900 card. I was using AMD APP SDK. When I copy my binaries to the other machine (Windows 8 with NVIDIA Quadro 4000 card) I get "The procedure entry point clReleaseDevice couldn't be located in the dynamic linked library (exe of my program)". This second machine has the latest NVIDIA drivers and CUDA 5 installed. Any ideas on what to I need to make it work with NVIDIA hardware? 回答1: Its an

Is there really a timeout for kernels on nvidia gpus?

阅读更多关于 Is there really a timeout for kernels on nvidia gpus?

问题 searching for answers for why my kernels produce strange error messages or "0" only results I found this answer on SO that mentions that there is a timeout of 5s for kernels running on nvidia gpus? I googled for the timout but I could not find confirming sources or more information. What do you know about it? Could the timout cause strange behaviour for kernels with a long runtime? Thanks! 回答1: Further googling brought up this in the CUDA_Toolkit_Release_Notes_Linux.txt (Known Issus): #

A question about the details about the distribution from blocks to SMs in CUDA

阅读更多关于 A question about the details about the distribution from blocks to SMs in CUDA

问题 Let me take the hardware with computation ability 1.3 as an example. 30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources. My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are

CUDA Visual Studio 2010 Express build error

阅读更多关于 CUDA Visual Studio 2010 Express build error

问题 I am trying to get started with CUDA programming on Windows using Visual Studio 2010 Express on a 64 bit Windows 7. It took me a while setting up the environment, and I just wrote my first program, helloWorld.cu :) Currently I am working with the following program: #include <stdio.h> __global__ void add(int a, int b, int *c){ *c = a + b; } int main(void){ int c; int *dev_c; HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) ); add<<<1,1>>>(2, 7, dev_c); HANDLE_ERROR( cudaMemcpy( &c, dev

GPU Programming, CUDA or OpenCL? [closed]

阅读更多关于 GPU Programming, CUDA or OpenCL? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am a newbie to GPU programming. I have a laptop with NVIDIA GeForce GT 640 card. I am faced with 2 dilemmas, suggestions are most welcome. If I go for CUDA -- Ubuntu or Windows Clearly CUDA is more suitable to windows while it can be a severe issue to install on Ubuntu. I have

How to measure GPU vs CPU performance? Which time measurement functions?

阅读更多关于 How to measure GPU vs CPU performance? Which time measurement functions?

问题 What libraries or functions need to be used for an objective comparison of CPU and GPU performance? What caveat should be warned for the sake of an accurate evaluation? I using an Ubuntu platform with a device having compute capability 2.1 and working with the CUDA 5 toolkit. 回答1: I'm using the following CPU - return microseconds between tic and toc with 2 microseconds of resolution #include <sys/time.h> #include <time.h> struct timespec init; struct timespec after; void tic() { clock_gettime

Inter-block synchronization in CUDA

阅读更多关于 Inter-block synchronization in CUDA

问题 I've searched a month for this problem. I cannot synchronize blocks in CUDA. I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array. When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but

Simulating pipeline program with CUDA

阅读更多关于 Simulating pipeline program with CUDA

问题 Say I have two arrays A and B and a kernel1 that does some calculation on both arrays (vector addition for example) by breaking the arrays into different chunks and and writes the partial result to C . kernel1 then keeps doing this until all elements in the arrays are processed. unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; unsigned int gridSize = blockDim.x*gridDim.x; //iterate through each chunk of gridSize in both A and B while (i < N) { C[i] = A[i] + B[i]; i += gridSize; } Say,