cuda | 易学教程

Smart design for large kernel with different inputs that only changes one line of code

阅读更多关于 Smart design for large kernel with different inputs that only changes one line of code

问题 I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical. Something like __global__ kernel(float * out , ....) { // 200

Smart design for large kernel with different inputs that only changes one line of code

阅读更多关于 Smart design for large kernel with different inputs that only changes one line of code

How to transpose a sparse matrix in cuSparse?

阅读更多关于 How to transpose a sparse matrix in cuSparse?

问题 I am trying to compute A^TA using cuSparse. A is a large but sparse matrix. The proper function to use based on the documentation is cusparseDcsrgemm2 . However, this is one of the few cuSparse operations that doesn't support an optional built-in transpose for the input matrix. There's a line in the documentation that said Only the NN version is supported. For other modes, the user has to transpose A or B explicitly. The problem is I couldn't find a function in cuSparse that can perform a

thrust functor: “too many resources requested for launch”

阅读更多关于 thrust functor: “too many resources requested for launch”

问题 I'm trying to implement something like this in CUDA: for each element p = { p if p >= floor z if p < floor Where floor and z are constants configured at the start of the test. I have attempted to implement it like so, but I get the error "too many resources requested for launch" A functor: struct floor_functor : thrust::unary_function <float, float> { const float floorLevel, floorVal; floor_functor(float _floorLevel, float _floorVal) : floorLevel(_floorLevel), floorVal(_floorVal){} __host__ _

Correct way to use constant memory on CUDA?

阅读更多关于 Correct way to use __constant__ memory on CUDA?

问题 I have an array I would like to initialize in __constant__ memory on the CUDA device. I don't know it's size or the values until runtime. I know I can use __constant__ float Points[**N**][2] or something like that, but how do I make this dynamic? Maybe in the form of __constant__ float* Points ? Is this possible? And possibly more important, is this a good idea? If there are better alternatives to doing it like this I would love to hear them. 回答1: As it has been discussed in Dynamic

How to execute atomic write in CUDA?

阅读更多关于 How to execute atomic write in CUDA?

问题 First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code: global_mem[0] = pick_at_random_from(1, 2); shared_mem[0] = pick_at_random_from(1, 2); executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic

How to execute atomic write in CUDA?

阅读更多关于 How to execute atomic write in CUDA?

How to execute atomic write in CUDA?

阅读更多关于 How to execute atomic write in CUDA?

CUDA - CUBLAS: issues solving many (3x3) dense linear systems

阅读更多关于 CUDA - CUBLAS: issues solving many (3x3) dense linear systems

问题 I am trying to solve about 1200000 linear systems (3x3, Ax=B) with CUDA 10.1, in particular using the CUBLAS library. I took a cue from this (useful!) post and re-wrote the suggested code in a Unified Memory version. The algorithm firstly performs a LU factorization using cublasgetrfBatched() followed by two consecutive invocations of cublastrsm() which solves upper or lower triangular linear systems. The code is attached below. It works correctly up to about 10000 matrixes and, in this case,

CUDA - CUBLAS: issues solving many (3x3) dense linear systems

阅读更多关于 CUDA - CUBLAS: issues solving many (3x3) dense linear systems