thrust | 易学教程

Simple CUDA Thrust Program Error

阅读更多关于 Simple CUDA Thrust Program Error

I just write an simple CUDA Thrust program, but when I run it. I got this error: thrust::system::system_error at position 0x0037f99c . Can someone help me to figure out why this happen? #include<thrust\host_vector.h> #include<thrust\device_vector.h> #include<iostream> using namespace std; using namespace thrust; int main() { thrust::host_vector<int> h_vec(3); h_vec[0]=1;h_vec[1]=2;h_vec[2]=3; thrust::device_vector<int> d_vec(3) ; d_vec= h_vec; int h_sum = thrust::reduce(h_vec.begin(), h_vec.end()); int d_sum = thrust::reduce(d_vec.begin(), d_vec.end()); return 0; } Robert Crovella A few

Mix custom memory management and Thrust in CUDA

阅读更多关于 Mix custom memory management and Thrust in CUDA

问题 In my project, I have implemented a custom memory allocator to avoid unneccessary calls to cudaMalloc once the application has "warmed up". Moreover, I use custom kernels for basic array filling, arithmetic operations between arrays, etc. and would like to simplify my code by using Thrust and getting rid of these kernels. Every array on the device is created and accessed through raw pointers (for now) and I'd like to use device_vector and Thrust s methods on these objects, but I find myself

How to generate random permutations with CUDA

阅读更多关于 How to generate random permutations with CUDA

问题 What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful. A sequential version of this would be the Fisher-Yates shuffle. Example: Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}. 回答1: Fisher-Yates shuffle could be parallelized. For

passing thrust::device_vector to a function by reference

阅读更多关于 passing thrust::device_vector to a function by reference

I'm trying to pass device_vector of structures struct point { unsigned int x; unsigned int y; } to a function in a following manner: void print(thrust::device_vector<point> &points, unsigned int index) { std::cout << points[index].y << points[index].y << std::endl; } myvector was initialized properly print(myvector, 0); I get following errors: error: class "thrust::device_reference<point>" has no member "x" error: class "thrust::device_reference<point>" has no member "y" What's wrong with it? Unfortunately, device_reference<T> cannot expose members of T , but it can convert to T . To implement

thrust reduction result on device memory

阅读更多关于 thrust reduction result on device memory

Is it possible to leave the return value of a thrust::reduce operation in device-allocated memory? In case it is, is it just as easy as assigning the value to a cudaMalloc'ed area, or should I use a thrust::device_ptr? Is it possible to leave the return value of a thrust::reduce operation in device-allocated memory? The short answer is no. thrust reduce returns a quantity, the result of the reduction. This quantity must be deposited in a host resident variable : Take for example reduce, which is synchronous and always returns its result to the CPU: template<typename Iterator, typename T> T

Replicate a vector multiple times using CUDA Thrust

阅读更多关于 Replicate a vector multiple times using CUDA Thrust

I am trying to solve a problem using CUDA Thrust. I have a host array with 3 elements. Is it possible, using Thrust, to create a device array of 384 elements in which the 3 elements in my host array is repeated 128 times ( 128 x 3 = 384 )? Generally speaking, starting from an array of 3 elements, how can I use Thrust to generate a device array of size X , where X = Y x 3 , i.e. Y is the number of repetitions? One possible approach: create a device vector of appropriate size create 3 strided ranges , one for each of the element positions {1, 2, 3} in the final output (device) vector use thrust:

fast CUDA thrust custom comparison operator

阅读更多关于 fast CUDA thrust custom comparison operator

I'm evaluating CUDA and currently using Thrust library to sort numbers. I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h . However it seems to be compiled in some other way and works very slowly. default comparer: thrust::less() - 94 ms my own comparer: less() - 906 ms I'm using Visual Studio 2010. What should I do to get the same performance as at option 1? Complete code: #include <stdio.h> #include <cuda.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include

Determining the least element and its position in each matrix column with CUDA Thrust

阅读更多关于 Determining the least element and its position in each matrix column with CUDA Thrust

I have a fairly simple problem but I cannot figure out an elegant solution to it. I have a Thrust code which produces c vectors of same size containing values. Let say each of these c vectors have an index. I would like for each vector position to get the index of the c vector for which the value is the lowest: Example: C0 = (0,10,20,3,40) C1 = (1,2 ,3 ,5,10) I would get as result a vector containing the index of the C vector which has the lowest value: result = (0,1 ,1 ,0,1) I have thought about doing it using thrust zip iterators, but have come accross issues: I could zip all the c vectors

Finding the maximum element value AND its position using CUDA Thrust

阅读更多关于 Finding the maximum element value AND its position using CUDA Thrust

问题 How do I get not only the value but also the position of the maximum (minimum) element ( res.val and res.pos )? thrust::host_vector<float> h_vec(100); thrust::generate(h_vec.begin(), h_vec.end(), rand); thrust::device_vector<float> d_vec = h_vec; T res = -1; res = thrust::reduce(d_vec.begin(), d_vec.end(), res, thrust::maximum<T>()); 回答1: Don't use thrust::reduce . Use thrust::max_element ( thrust::min_element ) in thrust/extrema.h : thrust::host_vector<float> h_vec(100); thrust::generate(h

How to asynchronously copy memory from the host to the device using thrust and CUDA streams

阅读更多关于 How to asynchronously copy memory from the host to the device using thrust and CUDA streams

I would like to copy memory from the host to the device using thrust as in thrust::host_vector<float> h_vec(1 << 28); thrust::device_vector<float> d_vec(1 << 28); thrust::copy(h_vec.begin(), h_vec.end(), d_vec.begin()); using CUDA streams analogously to how you would copy memory from the device to the device using streams: cudaStream_t s; cudaStreamCreate(&s); thrust::device_vector<float> d_vec1(1 << 28), d_vec2(1 << 28); thrust::copy(thrust::cuda::par.on(s), d_vec1.begin(), d_vec1.end(), d_vec2.begin()); cudaStreamSynchronize(s); cudaStreamDestroy(s); The problem is that I can't set the