thrust | 易学教程

How to implement nested loops in cuda thrust

阅读更多关于 How to implement nested loops in cuda thrust

问题 I currently have to run a nested loop as follow: for(int i = 0; i < N; i++){ for(int j = i+1; j <= N; j++){ compute(...)//some calculation here } } I've tried leaving the first loop in CPU and do the second loop in GPU . Results are too many memory access . Is there any other ways to do it? For example by thrust::reduce_by_key ? The whole program is here: #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/generate.h> #include <thrust/sort.h> #include <thrust

How can I find row to all rows distance matrix between two matrices W and X in Thrust or Cublas?

阅读更多关于 How can I find row to all rows distance matrix between two matrices W and X in Thrust or Cublas?

I have following matlab code; tempx = full(sum(X.^2, 2)); tempc = full(sum(C.^2, 2).'); D = -2*(X * C.'); D = bsxfun(@plus, D, tempx); D = bsxfun(@plus, D, tempc); where X is nxm and W is kxm matrices realtively. One is the data and the other is the weight matrix. I find the distance matrix D with the given code. I am watching an efficient Cublas or Thrust implementation of this operations. I succeeded the line D = -2*(X * C.'); by cublas but the residual part is still a question as a newbie? Can anybody help with a snippet or give suggestions? Here is what I have so far: Edit: I add some more

thrust: fill isolate space

阅读更多关于 thrust: fill isolate space

I have an array like this: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0 I want every non-zero elements to expand themselves one element at a time until it reaches other non-zero elements, the result is like this: 1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8 Is there any way to do this using thrust? Is there any way to do this using thrust? Yes, here is one possible approach. For each position in the sequence, compute 2 distances. The first is the distance to the nearest non-zero value in the left direction, and the second is the distance to the nearest non-zero value in the right direction. If the position

Thrust: How to directly control where an algorithm invocation executes?

阅读更多关于 Thrust: How to directly control where an algorithm invocation executes?

The following code has no information that may lead it to run at CPU or GPU. I wonder where is the "reduce" operation executed? #include <thrust/iterator/counting_iterator.h> ... // create iterators thrust::counting_iterator<int> first(10); thrust::counting_iterator<int> last = first + 3; first[0] // returns 10 first[1] // returns 11 first[100] // returns 110 // sum of [first, last) thrust::reduce(first, last); // returns 33 (i.e. 10 + 11 + 12) Furthermore, thrust::transform_reduce( thrust::counting_iterator<unsigned int>(0), thrust::counting_iterator<unsigned int>(N), MyOperation(data), 0

how fast is thrust::sort and what is the fastest radix sort implementation

阅读更多关于 how fast is thrust::sort and what is the fastest radix sort implementation

I'm a newbie to GPU programming. Recently, I'm trying to implement the gpu bvh construction algorithm based on an tutorial: http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/ . In the first step of this algorithm, the morton code(unsigned int) of every primitive is computed and sorted. The tutorial gives a reference time cost of the morton code computing and sorting for 12K objects: 0.02 ms, one thread per object: Calculate bounding box and assign Morton code. 0.18 ms, parallel radix sort: Sort the objects according to their Morton codes. ... In my

Timing Kernel launches in CUDA while using Thrust

阅读更多关于 Timing Kernel launches in CUDA while using Thrust

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once the CUDA kernel is launched control returns immediately to the CPU. The CPU continues doing some useful work while the GPU is busy number crunching unless the CPU is forcefully stalled using cudaThreadsynchronize() or cudaMemcpy() . Now I have just started using the Thrust library for CUDA. Are the function calls in Thrust synchronous or asynchronous? In other words, if I invoke thrust::sort(D.begin(),D.end()); where D is a device vector, does it make sense to measure the sorting time using start =

is there a better and a faster way to copy from CPU memory to GPU using thrust?

阅读更多关于 is there a better and a faster way to copy from CPU memory to GPU using thrust?

问题 Recently I have been using thrust a lot. I have noticed that in order to use thrust, one must always copy the data from the cpu memory to the gpu memory. Let's see the following example : int foo(int *foo) { host_vector<int> m(foo, foo+ 100000); device_vector<int> s = m; } I'm not quite sure how the host_vector constructor works, but it seems like I'm copying the initial data, coming from *foo , twice - once to the host_vector when it is initialized, and another time when device_vector is

Thrust copy - OutputIterator column-major order

阅读更多关于 Thrust copy - OutputIterator column-major order

问题 I have a vector of matrices (stored as column major arrays) that I want to concat vertically. Therefore, I want to utilize the copy function from the thrust framework as in the following example snippet: int offset = 0; for(int i = 0; i < matrices.size(); ++i) { thrust::copy( thrust::device_ptr<float>(matrices[i]), thrust::device_ptr<float>(matrices[i]) + rows[i] * cols[i], thrust::device_ptr<float>(result) + offset ); offset += rows[i] * cols[i]; } EDIT: extended example: The problem is,

Thrust equivalent of Open MP code

阅读更多关于 Thrust equivalent of Open MP code

问题 The code i'm trying to parallelize in open mp is a Monte Carlo that boils down to something like this: int seed = 0; std::mt19937 rng(seed); double result = 0.0; int N = 1000; #pragma omp parallel for for(i=0; x < N; i++) { result += rng() } std::cout << result << std::endl; I want to make sure that the state of the random number generator is shared across threads, and the addition to the result is atomic. Is there a way of replacing this code with something from thrust::omp. From the

Sorting objects with Thrust CUDA

阅读更多关于 Sorting objects with Thrust CUDA

Is it possible to sort objects using the Thrust library? I have the following struct: struct OB{ int N; Cls *C; //CLS is another struct. } Is it possible to use thrust in order to sort an array of OB according to N? Can you provide a simple example on using thrust to sort objects? If thrust is not able to do so, is there any other CUDA libraries that allows me to do so? The docs for thrust::sort show it accepts a comparison operator. See in their example how those are defined and used. I haven't tested this, but based on the example, all you would need is a struct that looks something like