cuda


Polymorphism and derived classes in CUDA / CUDA Thrust

心不动则不痛 提交于 2020-01-27 07:56:13
问题 This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector<BaseClass> if I want it to store objects of different types DerivedClass1 , DerivedClass2 , etc, simultaneously? I want to take advantage of polymorphism with CUDA Thrust. I'm compiling for an -arch=sm_30 GPU (GeForce GTX 670). Let us take a look at the following problem: Suppose there are 80 families in town. 60 of them are married couples, 20 of them

Polymorphism and derived classes in CUDA / CUDA Thrust

有些话、适合烂在心里 提交于 2020-01-27 07:56:05
问题 This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector<BaseClass> if I want it to store objects of different types DerivedClass1 , DerivedClass2 , etc, simultaneously? I want to take advantage of polymorphism with CUDA Thrust. I'm compiling for an -arch=sm_30 GPU (GeForce GTX 670). Let us take a look at the following problem: Suppose there are 80 families in town. 60 of them are married couples, 20 of them

multi-precision multiplication in CUDA

一笑奈何 提交于 2020-01-26 04:27:09
问题 I am trying to implement multi-precision multiplication in CUDA. For doing that, I have implemented a kernel which should compute multiplication of uint32_t type operand with 256-bit operand and put the result in 288-bit array. So far, I have came up with this code: __device__ __constant__ UN_256fe B_const; __global__ void multiply32x256Kernel(uint32_t A, UN_288bite* result){ uint8_t tid = blockIdx.x * blockDim.x + threadIdx.x; //for managing warps //uint8_t laineid = tid % 32; //allocate

Rank of each element in a matrix row using CUDA

本秂侑毒 提交于 2020-01-26 04:17:05
问题 Is there any way to find the rank of element in a matrix row separately using CUDA or any functions for the same provided by NVidia? 回答1: I don't know of a built-in ranking or argsort function in CUDA or any of the libraries I am familiar with. You could certainly build such a function out of lower-level operations using thrust for example. Here is a (non-optimized) outline of a possible solution approach using thrust: $ cat t84.cu #include <thrust/device_vector.h> #include <thrust/copy.h>

CUDA FFT exception

最后都变了- 提交于 2020-01-26 03:15:10
问题 I'm trying to use CUDA FFT aka cufft library Problem occured when cufftPlan1d(..) throws an exception. #define NX 256 #define BATCH 10 cufftHandle plan; cufftComplex *data; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return; } if (cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return; } When the copiler hit the

Is it possible to have a CUDA kernel with varying number of parameters?

那年仲夏 提交于 2020-01-25 20:22:27
问题 I would like to make a kernel which takes a number of arguments, that is not set. Is this possible? I guess this does not work? But why? 回答1: if you are asking about typical C style vargs, then no. But because kernels support C++ linkage, there are template and name mangling tricks which can be used to instantiate different versions of a kernel with length and different types of argument lists. Note also that CUDA 7.0 introduces C++11 variadic template support. So there are options to do this

How do you iterate through a pitched CUDA array?

為{幸葍}努か 提交于 2020-01-25 18:10:52
问题 Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion. Cuda by Example is a great start. The snippet on page 43 shows: __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; // handle the data at this index if (tid < N) c[tid] = a[tid] + b[tid]; } Whereas in OpenMP the programmer chooses the number of times the

How do you iterate through a pitched CUDA array?

倾然丶 夕夏残阳落幕 提交于 2020-01-25 18:09:05
问题 Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion. Cuda by Example is a great start. The snippet on page 43 shows: __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; // handle the data at this index if (tid < N) c[tid] = a[tid] + b[tid]; } Whereas in OpenMP the programmer chooses the number of times the

cuda kernel for add(a,b,c) using texture objects for a & b - works correctly for 'increment operation' add(a,b,a)?

吃可爱长大的小学妹 提交于 2020-01-25 11:31:47
问题 I want to implement a cuda function 'add(a,b,c)' for adding (component-wise) two one-channel floating-point images 'a' and 'b' together and storing the result in the floating-point image 'c'. So 'c = a + b'. The function will be implemented by first binding texture objects 'aTex' and 'bTex' to the pitch-linear images 'a' and 'b', and then accessing the image 'a' and 'b' inside the kernel only via the texture objects 'aTex' and 'bTex'. The sum is stored in 'c' via a simple write to global

cuda kernel for add(a,b,c) using texture objects for a & b - works correctly for 'increment operation' add(a,b,a)?

三世轮回 提交于 2020-01-25 11:31:46
问题 I want to implement a cuda function 'add(a,b,c)' for adding (component-wise) two one-channel floating-point images 'a' and 'b' together and storing the result in the floating-point image 'c'. So 'c = a + b'. The function will be implemented by first binding texture objects 'aTex' and 'bTex' to the pitch-linear images 'a' and 'b', and then accessing the image 'a' and 'b' inside the kernel only via the texture objects 'aTex' and 'bTex'. The sum is stored in 'c' via a simple write to global

工具导航Map