gpu-programming | 易学教程

Sum a variable over all threads in a CUDA Kernel and return it to Host

阅读更多关于 Sum a variable over all threads in a CUDA Kernel and return it to Host

问题 I new in cuda and I'm try to implement a Kernel to calculate the energy of my Metropolis Monte Carlo Simulation. I'll put here the linear version of this function: float calc_energy(struct frame frm, float L, float rc){ int i,j; float E=0, rij, dx, dy, dz; for(i=0; i<frm.natm; i++) { for(j=i+1; j<frm.natm; j++) { dx = fabs(frm.conf[j][0] - frm.conf[i][0]); dy = fabs(frm.conf[j][1] - frm.conf[i][1]); dz = fabs(frm.conf[j][2] - frm.conf[i][2]); dx = dx - round(dx/L)*L; dy = dy - round(dy/L)*L;

CUDA FFT exception

阅读更多关于 CUDA FFT exception

问题 I'm trying to use CUDA FFT aka cufft library Problem occured when cufftPlan1d(..) throws an exception. #define NX 256 #define BATCH 10 cufftHandle plan; cufftComplex *data; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return; } if (cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return; } When the copiler hit the

Matrix Multiplication giving wrong output [duplicate]

阅读更多关于 Matrix Multiplication giving wrong output [duplicate]

问题 This question already has an answer here : Unable to execute device kernel in CUDA (1 answer) Closed 4 years ago . What I am attempting to do is Multiply Matrix A & Matrix B and then from the product matrix I get the index of the maximum value per column. But unfortunately, only the first 128*128 values of the matrix multiplication are correct while others are just garbage. I do not quite understand how this works. I request you to kindly guide me with this .. #include<stdio.h> #include "cuda

Matrix Multiplication giving wrong output [duplicate]

阅读更多关于 Matrix Multiplication giving wrong output [duplicate]

OpenCL : Querying max clock frequency of a mobile GPU always returns a lesser value

阅读更多关于 OpenCL : Querying max clock frequency of a mobile GPU always returns a lesser value

问题 In order to know the max clock frequency of a Mali T760 GPU, I used the code snippet below: // Get device max clock frequency cl_uint max_clock_freq; err_num = clGetDeviceInfo(cl_devices[device_idx], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(max_clock_freq), &max_clock_freq, NULL); check_cl_error(err_num, "clGetDeviceInfo: Getting device max clock frequency"); printf("CL_DEVICE_MAX_CLOCK_FREQUENCY: %d MHz\n", max_clock_freq); Full source code available here: https://github.com/sivagnanamn/opencl

How to use GPU for mathematics [closed]

阅读更多关于 How to use GPU for mathematics [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I am looking at utilising the GPU for crunching some equations but cannot figure out how I can access it from C#. I know that the XNA and DirectX frameworks allow you to use shaders in order to access the GPU, but how would I go about accessing it without these frameworks? 回答1: I

How to use GPU for mathematics [closed]

阅读更多关于 How to use GPU for mathematics [closed]

Unable to execute device kernel in CUDA

阅读更多关于 Unable to execute device kernel in CUDA

问题 I am trying to call a device kernel within a global kernel. My global kernel is a Matrix Multiplication and my device kernel is finding the maximum value and the index in each column of the product matrix. Following is the code : __device__ void MaxFunction(float* Pd, float* max) { int x = (threadIdx.x + blockIdx.x * blockDim.x); int y = (threadIdx.y + blockIdx.y * blockDim.y); int k = 0; int temp = 0; int temp_idx = 0; for (k = 0; k < wB; ++k) { if(Pd[x*wB + y] > temp){ temp = Pd[x*wB + y];

printf inside CUDA global function

阅读更多关于 printf inside CUDA __global__ function

问题 I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function: __global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){ int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; float sum = 0; for( int k = 0; k < Ad.width ; ++k){ float Melement = Ad.elements[ty * Ad.width

printf inside CUDA global function

阅读更多关于 printf inside CUDA __global__ function