gpu-programming

Sum a variable over all threads in a CUDA Kernel and return it to Host

末鹿安然 提交于 2020-06-09 05:18:04
问题 I new in cuda and I'm try to implement a Kernel to calculate the energy of my Metropolis Monte Carlo Simulation. I'll put here the linear version of this function: float calc_energy(struct frame frm, float L, float rc){ int i,j; float E=0, rij, dx, dy, dz; for(i=0; i<frm.natm; i++) { for(j=i+1; j<frm.natm; j++) { dx = fabs(frm.conf[j][0] - frm.conf[i][0]); dy = fabs(frm.conf[j][1] - frm.conf[i][1]); dz = fabs(frm.conf[j][2] - frm.conf[i][2]); dx = dx - round(dx/L)*L; dy = dy - round(dy/L)*L;

CUDA FFT exception

最后都变了- 提交于 2020-01-26 03:15:10
问题 I'm trying to use CUDA FFT aka cufft library Problem occured when cufftPlan1d(..) throws an exception. #define NX 256 #define BATCH 10 cufftHandle plan; cufftComplex *data; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return; } if (cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return; } When the copiler hit the

Matrix Multiplication giving wrong output [duplicate]

不羁的心 提交于 2020-01-25 07:13:11
问题 This question already has an answer here : Unable to execute device kernel in CUDA (1 answer) Closed 4 years ago . What I am attempting to do is Multiply Matrix A & Matrix B and then from the product matrix I get the index of the maximum value per column. But unfortunately, only the first 128*128 values of the matrix multiplication are correct while others are just garbage. I do not quite understand how this works. I request you to kindly guide me with this .. #include<stdio.h> #include "cuda

Matrix Multiplication giving wrong output [duplicate]

久未见 提交于 2020-01-25 07:13:04
问题 This question already has an answer here : Unable to execute device kernel in CUDA (1 answer) Closed 4 years ago . What I am attempting to do is Multiply Matrix A & Matrix B and then from the product matrix I get the index of the maximum value per column. But unfortunately, only the first 128*128 values of the matrix multiplication are correct while others are just garbage. I do not quite understand how this works. I request you to kindly guide me with this .. #include<stdio.h> #include "cuda

OpenCL : Querying max clock frequency of a mobile GPU always returns a lesser value

牧云@^-^@ 提交于 2020-01-24 15:48:08
问题 In order to know the max clock frequency of a Mali T760 GPU, I used the code snippet below: // Get device max clock frequency cl_uint max_clock_freq; err_num = clGetDeviceInfo(cl_devices[device_idx], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(max_clock_freq), &max_clock_freq, NULL); check_cl_error(err_num, "clGetDeviceInfo: Getting device max clock frequency"); printf("CL_DEVICE_MAX_CLOCK_FREQUENCY: %d MHz\n", max_clock_freq); Full source code available here: https://github.com/sivagnanamn/opencl

How to use GPU for mathematics [closed]

落爺英雄遲暮 提交于 2020-01-20 13:32:25
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I am looking at utilising the GPU for crunching some equations but cannot figure out how I can access it from C#. I know that the XNA and DirectX frameworks allow you to use shaders in order to access the GPU, but how would I go about accessing it without these frameworks? 回答1: I

How to use GPU for mathematics [closed]

余生颓废 提交于 2020-01-20 13:31:34
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I am looking at utilising the GPU for crunching some equations but cannot figure out how I can access it from C#. I know that the XNA and DirectX frameworks allow you to use shaders in order to access the GPU, but how would I go about accessing it without these frameworks? 回答1: I

Unable to execute device kernel in CUDA

时光总嘲笑我的痴心妄想 提交于 2020-01-11 13:59:13
问题 I am trying to call a device kernel within a global kernel. My global kernel is a Matrix Multiplication and my device kernel is finding the maximum value and the index in each column of the product matrix. Following is the code : __device__ void MaxFunction(float* Pd, float* max) { int x = (threadIdx.x + blockIdx.x * blockDim.x); int y = (threadIdx.y + blockIdx.y * blockDim.y); int k = 0; int temp = 0; int temp_idx = 0; for (k = 0; k < wB; ++k) { if(Pd[x*wB + y] > temp){ temp = Pd[x*wB + y];

printf inside CUDA __global__ function

£可爱£侵袭症+ 提交于 2020-01-10 12:03:39
问题 I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function: __global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){ int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; float sum = 0; for( int k = 0; k < Ad.width ; ++k){ float Melement = Ad.elements[ty * Ad.width

printf inside CUDA __global__ function

限于喜欢 提交于 2020-01-10 12:03:09
问题 I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function: __global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){ int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; float sum = 0; for( int k = 0; k < Ad.width ; ++k){ float Melement = Ad.elements[ty * Ad.width