cuda

Smart design for large kernel with different inputs that only changes one line of code

﹥>﹥吖頭↗ 提交于 2021-02-11 17:10:17
问题 I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical. Something like __global__ kernel(float * out , ....) { // 200

Smart design for large kernel with different inputs that only changes one line of code

℡╲_俬逩灬. 提交于 2021-02-11 17:09:46
问题 I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical. Something like __global__ kernel(float * out , ....) { // 200

How to transpose a sparse matrix in cuSparse?

会有一股神秘感。 提交于 2021-02-11 14:49:08
问题 I am trying to compute A^TA using cuSparse. A is a large but sparse matrix. The proper function to use based on the documentation is cusparseDcsrgemm2 . However, this is one of the few cuSparse operations that doesn't support an optional built-in transpose for the input matrix. There's a line in the documentation that said Only the NN version is supported. For other modes, the user has to transpose A or B explicitly. The problem is I couldn't find a function in cuSparse that can perform a

thrust functor: “too many resources requested for launch”

我怕爱的太早我们不能终老 提交于 2021-02-11 11:55:07
问题 I'm trying to implement something like this in CUDA: for each element p = { p if p >= floor z if p < floor Where floor and z are constants configured at the start of the test. I have attempted to implement it like so, but I get the error "too many resources requested for launch" A functor: struct floor_functor : thrust::unary_function <float, float> { const float floorLevel, floorVal; floor_functor(float _floorLevel, float _floorVal) : floorLevel(_floorLevel), floorVal(_floorVal){} __host__ _

Correct way to use __constant__ memory on CUDA?

邮差的信 提交于 2021-02-11 10:25:08
问题 I have an array I would like to initialize in __constant__ memory on the CUDA device. I don't know it's size or the values until runtime. I know I can use __constant__ float Points[**N**][2] or something like that, but how do I make this dynamic? Maybe in the form of __constant__ float* Points ? Is this possible? And possibly more important, is this a good idea? If there are better alternatives to doing it like this I would love to hear them. 回答1: As it has been discussed in Dynamic

How to execute atomic write in CUDA?

旧街凉风 提交于 2021-02-11 09:26:40
问题 First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code: global_mem[0] = pick_at_random_from(1, 2); shared_mem[0] = pick_at_random_from(1, 2); executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic

How to execute atomic write in CUDA?

喜夏-厌秋 提交于 2021-02-11 09:26:26
问题 First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code: global_mem[0] = pick_at_random_from(1, 2); shared_mem[0] = pick_at_random_from(1, 2); executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic

How to execute atomic write in CUDA?

老子叫甜甜 提交于 2021-02-11 09:24:20
问题 First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code: global_mem[0] = pick_at_random_from(1, 2); shared_mem[0] = pick_at_random_from(1, 2); executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic

CUDA - CUBLAS: issues solving many (3x3) dense linear systems

元气小坏坏 提交于 2021-02-10 15:45:08
问题 I am trying to solve about 1200000 linear systems (3x3, Ax=B) with CUDA 10.1, in particular using the CUBLAS library. I took a cue from this (useful!) post and re-wrote the suggested code in a Unified Memory version. The algorithm firstly performs a LU factorization using cublasgetrfBatched() followed by two consecutive invocations of cublastrsm() which solves upper or lower triangular linear systems. The code is attached below. It works correctly up to about 10000 matrixes and, in this case,

CUDA - CUBLAS: issues solving many (3x3) dense linear systems

[亡魂溺海] 提交于 2021-02-10 15:44:28
问题 I am trying to solve about 1200000 linear systems (3x3, Ax=B) with CUDA 10.1, in particular using the CUBLAS library. I took a cue from this (useful!) post and re-wrote the suggested code in a Unified Memory version. The algorithm firstly performs a LU factorization using cublasgetrfBatched() followed by two consecutive invocations of cublastrsm() which solves upper or lower triangular linear systems. The code is attached below. It works correctly up to about 10000 matrixes and, in this case,