gpu-programming

CUDA Matrix multiplication breaks for large matrices

你离开我真会死。 提交于 2019-12-19 02:23:11
问题 I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so. int size = Width*Width*sizeof(float); float* Md, *Nd, *Pd; cudaError_t err = cudaSuccess; //Allocate Device Memory for M, N and P err = cudaMalloc((void**)&Md, size); err = cudaMalloc((void**)&Nd, size); err = cudaMalloc((void**)&Pd,

How to interrupt or cancel a CUDA kernel from host code

北城余情 提交于 2019-12-19 02:03:21
问题 I am working with CUDA and I am trying to stop my kernels work (i.e. terminate all running threads) after a certain if block is being hit. How can I do that? I am really stuck in here. 回答1: I assume you want to stop a running kernel (not a single thread). The simplest approach (and the one that I suggest) is to set up a global memory flag which is been tested by the kernel. You can set the flag using cudaMemcpy() (or without if using unified memory). Like the following: if (gm_flag) { _

Compiling an OpenCL program using a CL/cl.h file

浪子不回头ぞ 提交于 2019-12-18 12:52:54
问题 I have sample "Hello, World!" code from the net and I want to run it on the GPU on my university's server. When I type "gcc main.c," it responds with: CL/cl.h: No such file or directory What should I do? How can I have this header file? 回答1: Make sure you have the appropriate toolkit installed. This depends on what you intend running your code on. If you have an NVidia card then you need to download and install the CUDA-toolkit which also contains the necessary binaries and libraries for

cuda 5.0 dynamic parallelism error: ptxas fatal . unresolved extern function 'cudaLaunchDevice

跟風遠走 提交于 2019-12-17 16:50:32
问题 I am using tesla k20 with compute capability 35 on Linux with CUDA 5.With a simple child kernel call it gives a compile error : Unresolved extern function cudaLaunchDevice My command line looks like: nvcc --compile -G -O0 -g -gencode arch=compute_35 , code=sm_35 -x cu -o fill.cu fill.o I see cudadevrt.a in lib64.. Do we need to add it or what coukd be done to resolve it? Without child kernel call everything works fine. 回答1: You must explicitly compile with relocatable device code enabled and

Does CudaMallocManaged allocate memory on the device?

泄露秘密 提交于 2019-12-13 01:24:19
问题 I'm using Unified Memory to simplify access to data on the CPU and GPU. As far as I know, cudaMallocManaged should allocate memory on the device. I wrote a simple code to check that: #define TYPE float #define BDIMX 16 #define BDIMY 16 #include <cuda.h> #include <cstdio> #include <iostream> __global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy) { __shared__ float s_data[BDIMY][BDIMX]; int ix = blockIdx.x * blockDim.x + threadIdx.x; int iy = blockIdx.y * blockDim

How to select a GPU with CUDA?

浪子不回头ぞ 提交于 2019-12-12 20:13:58
问题 I have a computer with 2 GPUs; I wrote a CUDA C program and I need to tell it somehow that I want to run it on just 1 out of the 2 graphic cards; what is the command I need to type and how should I use it? I believe somehow that is related to the cudaSetDevice but I can't really find out how to use it. 回答1: It should be pretty much clear from documentation of cudaSetDevice, but let me provide following code snippet. bool IsGpuAvailable() { int devicesCount; cudaGetDeviceCount(&devicesCount);

glFenceSync alternative in OpenGL ES 2.0

懵懂的女人 提交于 2019-12-12 18:54:07
问题 I see that glFenceSync does not exist in OpenGL ES 2.0, it was added only in OpenGL ES 3.0. Does OpenGL ES 2.0 offer any alternative of syncing between CPU and GPU, aside from the brutal force glFinish ? 回答1: In glext.h . GL_API GLsync glFenceSyncAPPLE(GLenum condition, GLbitfield flags) __OSX_AVAILABLE_STARTING(__MAC_NA,__IPHONE_6_0); I am pretty sure this is what you want. Anyway available only on iOS 6.0 or later. 回答2: You have different calls in OpenGL ES 2.0 that give some insight into

Are CUDA .ptx files portable?

a 夏天 提交于 2019-12-12 11:35:44
问题 I'm studying the cudaDecodeD3D9 sample to learn how CUDA works, and at compilation it generates a .ptx file from a .cu file. This .ptx file is, as I understand it so far, an intermediate representation that will be compiled just-in-time for any specific GPU. The sample uses the class cudaModuleMgr to load this file via cuModuleLoadDataEx. The .ptx file is in text format, and I can see that at the top of it is a bunch of hardcoded paths on my machine, including my user folder, i.e.: .file 1 "C

Creating a copy of the buffer pointed by host ptr on the GPU from GPU kernel in OpenCL

时光怂恿深爱的人放手 提交于 2019-12-12 07:03:18
问题 I was trying to understand how exactly CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR work. Basically when using CL_MEM_USE_HOST_PTR, say in creating a 2D image, this will copy nothing to the device, instead the GPU will refer the mapped memory(clEnqueueMapBuffer maps it) on the host, do the processing and we can write the results to some other location. On the other hand if I use the CL_MEM_COPY_HOST_PTR, it will create a copy of the data pointed to by host ptr on the device(I guess it will

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

二次信任 提交于 2019-12-12 04:49:57
问题 I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong. This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks! Here is the regular code float* ha = new float[n]; //