thrust

CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array

岁酱吖の 提交于 2019-11-28 08:50:22
Let's say I have two device_vector<byte> arrays, d_keys and d_data . If d_data is, for example, a flattened 2D 3x5 array ( e.g. { 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3 } ) and d_keys is a 1D array of size 5 ( e.g. { 1, 0, 0, 1, 1 } ), how can I do a reduction such that I'd end up only adding values on a per-row basis if the corresponding d_keys value is one ( e.g. ending up with a result of { 10, 23, 14 } )? The sum_rows.cu example allows me to add every value in d_data , but that's not quite right. Alternatively, I can, on a per-row basis, use a zip_iterator and combine d_keys with one

How to normalize matrix columns in CUDA with max performance?

狂风中的少年 提交于 2019-11-28 07:48:16
How to effectively normalize matrix columns in CUDA? My matrix is stored in column-major, and the typical size is 2000x200. The operation can be represented in the following matlab code. A = rand(2000,200); A = exp(A); A = A./repmat(sum(A,1), [size(A,1) 1]); Can this be done effectively by Thrust, cuBLAS and/or cuNPP? A rapid implementation including 4 kernels is shown as follows. Wondering if these can be done in 1 or 2 kernels to improve the performance, especially for the column summation step implemented by cublasDgemv(). #include <cuda.h> #include <curand.h> #include <cublas_v2.h>

Static Thrust Custom Allocator?

走远了吗. 提交于 2019-11-28 05:59:16
问题 Just a couple facts for setup: Thrust doesn't operate in-place for all of it's operations. You can supply custom allocators to thrust::device_vectors . I've looked in thrust::system and thrust::system::cuda and haven't found anything that looks like a static system allocator. By that I mean, I can't see a way of replacing the allocator that thrust uses internally to allocate extra memory for the out-of-place algorithms. I also find it hard to believe that the functions that are not in-place

passing thrust::device_vector to a function by reference

孤人 提交于 2019-11-28 05:58:59
问题 I'm trying to pass device_vector of structures struct point { unsigned int x; unsigned int y; } to a function in a following manner: void print(thrust::device_vector<point> &points, unsigned int index) { std::cout << points[index].y << points[index].y << std::endl; } myvector was initialized properly print(myvector, 0); I get following errors: error: class "thrust::device_reference<point>" has no member "x" error: class "thrust::device_reference<point>" has no member "y" What's wrong with it?

Efficiency of CUDA vector types (float2, float3, float4)

痞子三分冷 提交于 2019-11-28 05:06:35
I'm trying to understand the integrate_functor in particles_kernel.cu from CUDA examples: struct integrate_functor { float deltaTime; //constructor for functor //... template <typename Tuple> __device__ void operator()(Tuple t) { volatile float4 posData = thrust::get<2>(t); volatile float4 velData = thrust::get<3>(t); float3 pos = make_float3(posData.x, posData.y, posData.z); float3 vel = make_float3(velData.x, velData.y, velData.z); // update position and velocity // ... // store new position and velocity thrust::get<0>(t) = make_float4(pos, posData.w); thrust::get<1>(t) = make_float4(vel,

Cuda Random Number Generation

拈花ヽ惹草 提交于 2019-11-28 01:46:31
I was wondering what was the best way to generate one pseudo random number between 0 and 49k that would be the same for each thread, by using curand or something else. I prefer to generate the random numbers inside the kernel because I will have to generate one at the time but about 10k times. And I could use floats between 0.0 and 1.0, but I've no idea how to make my PRN available for all threads, because most post and example show how to have different PRN for each threads. Thanks Probably you just need to study the curand documentation , especially for the device API . The key to getting

Thrust: sort_by_key slow due to memory allocation

二次信任 提交于 2019-11-27 23:24:15
I am doing a sort_by_key with key-value int arrays of size 80 million. The device is a GTX 560 Ti with 2GB VRAM. When the available (free) memory before the sort_by_key is 1200MB , it finishes sorting in 200ms . But, when the available memory drops to 600MB , the sort_by_key for the same key-value arrays takes 1.5-3s ! I ran the program under Compute Visual Profiler . I found that the GPU timestamp jumps by 1.5-3s between the last kernel before sort_by_key and the first kernel call inside sort_by_key (which is a RakingReduction ). I suspect there is a memory allocation being done inside sort

large integer addition with CUDA

[亡魂溺海] 提交于 2019-11-27 22:59:56
I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words. For example, we can use one thread to add two 32-bit words. For simplicity, let assume that the numbers to be added are of the same length and number of threads per block == number of words. Then: __global__ void add_kernel(int *C, const int *A, const int *B) { int x = A[threadIdx.x]; int y = B[threadIdx.x]; int z = x + y; int carry = (z < x); /** do carry propagation in parallel somehow ? */

Replicate a vector multiple times using CUDA Thrust

假装没事ソ 提交于 2019-11-27 22:41:38
问题 I am trying to solve a problem using CUDA Thrust. I have a host array with 3 elements. Is it possible, using Thrust, to create a device array of 384 elements in which the 3 elements in my host array is repeated 128 times ( 128 x 3 = 384 )? Generally speaking, starting from an array of 3 elements, how can I use Thrust to generate a device array of size X , where X = Y x 3 , i.e. Y is the number of repetitions? 回答1: One possible approach: create a device vector of appropriate size create 3

From thrust::device_vector to raw pointer and back?

别来无恙 提交于 2019-11-27 20:32:22
问题 I understand how to go from a vector to a raw pointer but im skipping a beat on how to go backwards. // our host vector thrust::host_vector<dbl2> hVec; // pretend we put data in it here // get a device_vector thrust::device_vector<dbl2> dVec = hVec; // get the device ptr thrust::device_ptr devPtr = &d_vec[0]; // now how do i get back to device_vector? thrust::device_vector<dbl2> dVec2 = devPtr; // gives error thrust::device_vector<dbl2> dVec2(devPtr); // gives error Can someone explain/point