gpu-programming

CUDA Matrix multiplication breaks for large matrices

一世执手 提交于 2019-11-30 19:50:09
I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so. int size = Width*Width*sizeof(float); float* Md, *Nd, *Pd; cudaError_t err = cudaSuccess; //Allocate Device Memory for M, N and P err = cudaMalloc((void**)&Md, size); err = cudaMalloc((void**)&Nd, size); err = cudaMalloc((void**)&Pd, size); //Copy Matrix from Host Memory to Device Memory err = cudaMemcpy(Md, M, size,

QR decomposition to solve linear systems in CUDA

跟風遠走 提交于 2019-11-30 16:27:05
I'm writing an image restoration algorithm on GPU, details in Cuda: least square solving , poor in speed The QR decomposition method to solve the linear system Ax=b works as follows min||Ax-b|| ---> ||QRx-b|| ---> ||(Q^T)QRx-(Q^T)b|| ---> ||Rx-(Q^T)b|| where R is the upper triangular matrix. The resulting upper triangular linear system is easy to solve. I want to use CULA tools to implement this method. The CULA routine GEQRF computes a QR factorization. The manual says: On exit, the elements on and above the diagonal of the array contain the min(M,N)-by-N upper trapezoidal matrix R ( R is

Compiling an OpenCL program using a CL/cl.h file

青春壹個敷衍的年華 提交于 2019-11-30 08:14:33
I have sample "Hello, World!" code from the net and I want to run it on the GPU on my university's server. When I type "gcc main.c," it responds with: CL/cl.h: No such file or directory What should I do? How can I have this header file? Make sure you have the appropriate toolkit installed. This depends on what you intend running your code on. If you have an NVidia card then you need to download and install the CUDA-toolkit which also contains the necessary binaries and libraries for opencl. Are you running Linux? If you believe you already have OpenCL installed it could be that it is found at

QR decomposition to solve linear systems in CUDA

跟風遠走 提交于 2019-11-29 23:06:57
问题 I'm writing an image restoration algorithm on GPU, details in Cuda: least square solving , poor in speed The QR decomposition method to solve the linear system Ax=b works as follows min||Ax-b|| ---> ||QRx-b|| ---> ||(Q^T)QRx-(Q^T)b|| ---> ||Rx-(Q^T)b|| where R is the upper triangular matrix. The resulting upper triangular linear system is easy to solve. I want to use CULA tools to implement this method. The CULA routine GEQRF computes a QR factorization. The manual says: On exit, the elements

How to use GPU for mathematics [closed]

荒凉一梦 提交于 2019-11-29 20:54:11
I am looking at utilising the GPU for crunching some equations but cannot figure out how I can access it from C#. I know that the XNA and DirectX frameworks allow you to use shaders in order to access the GPU, but how would I go about accessing it without these frameworks? I haven't done it from C#, but basically you use the CUDA (assuming you're using an nVidia card here, of course) SDK and CUDA toolkit to pull it off. nVidia has ported (or written?) a BLAS implementation for use on CUDA-capable devices. They've provided plenty of examples for how to do number crunching, although you'll have

Differences between VexCL, Thrust, and Boost.Compute

感情迁移 提交于 2019-11-29 19:52:00
With a just a cursory understanding of these libraries, they look to be very similar. I know that VexCL and Boost.Compute use OpenCl as a backend (although the v1.0 release VexCL also supports CUDA as a backend) and Thrust uses CUDA. Aside from the different backends, what's the difference between these. Specifically, what problem space do they address and why would I want to use one over the other. Also, on the Thrust FAQ it is stated that The primary barrier to OpenCL support is the lack of an OpenCL compiler and runtime with support for C++ templates If this is the case, how is it possible

Differences between VexCL, Thrust, and Boost.Compute

那年仲夏 提交于 2019-11-28 14:46:11
问题 With a just a cursory understanding of these libraries, they look to be very similar. I know that VexCL and Boost.Compute use OpenCl as a backend (although the v1.0 release VexCL also supports CUDA as a backend) and Thrust uses CUDA. Aside from the different backends, what's the difference between these. Specifically, what problem space do they address and why would I want to use one over the other. Also, on the Thrust FAQ it is stated that The primary barrier to OpenCL support is the lack of

How do you include standard CUDA libraries to link with NVRTC code?

▼魔方 西西 提交于 2019-11-28 13:46:44
Specifically, my issue is that I have CUDA code that needs <curand_kernel.h> to run. This isn't included by default in NVRTC. Presumably then when creating the program context (i.e. the call to nvrtcCreateProgram ), I have to send in the name of the file ( curand_kernel.h ) and also the source code of curand_kernel.h ? I feel like I shouldn't have to do that. It's hard to tell; I haven't managed to find an example from NVIDIA of someone needing standard CUDA files like this as a source, so I really don't understand what the syntax is. Some issues: curand_kernel.h also has includes... Do I have

How can using cooperative groups feature of CUDA in windows

感情迁移 提交于 2019-11-28 10:49:25
问题 My GPU is GeForce MX150, pascal architecture, CC. 6.1, CUDA 9.1, windows 10. Although my GPU is pascal but cooperative groups doesn't work. I want to use it for inter-block synchronization. I found my tcc mode doesn't active. I also found that doesn't active in wddm in windows. How can using cooperative groups? How can activate tcc mode in windows? Thanks for your reply. 回答1: You can't activate TCC on that GPU (it is not supported), and there is no way to use a cooperative launch under

CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array

岁酱吖の 提交于 2019-11-28 08:50:22
Let's say I have two device_vector<byte> arrays, d_keys and d_data . If d_data is, for example, a flattened 2D 3x5 array ( e.g. { 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3 } ) and d_keys is a 1D array of size 5 ( e.g. { 1, 0, 0, 1, 1 } ), how can I do a reduction such that I'd end up only adding values on a per-row basis if the corresponding d_keys value is one ( e.g. ending up with a result of { 10, 23, 14 } )? The sum_rows.cu example allows me to add every value in d_data , but that's not quite right. Alternatively, I can, on a per-row basis, use a zip_iterator and combine d_keys with one