thrust | 易学教程

fast CUDA thrust custom comparison operator

阅读更多关于 fast CUDA thrust custom comparison operator

问题 I'm evaluating CUDA and currently using Thrust library to sort numbers. I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h . However it seems to be compiled in some other way and works very slowly. default comparer: thrust::less() - 94 ms my own comparer: less() - 906 ms I'm using Visual Studio 2010. What should I do to get the same performance as at option 1? Complete code: #include

Combining two lists by key using Thrust

阅读更多关于 Combining two lists by key using Thrust

Given two key-value lists, I am trying to combine the two sides by matching the keys and applying a function to the two values when the keys match. In my case I want to multiply the values. A small example to make it more clear: Left keys: { 1, 2, 4, 5, 6 } Left values: { 3, 4, 1, 2, 1 } Right keys: { 1, 3, 4, 5, 6, 7 }; Right values: { 2, 1, 1, 4, 1, 2 }; Expected output keys: { 1, 4, 5, 6 } Expected output values: { 6, 1, 8, 1 } I have been able to implement this on the CPU using C++ using the next code: int main() { int leftKeys[5] = { 1, 2, 4, 5, 6 }; int leftValues[5] = { 3, 4, 1, 2, 1 };

Sorting packed vertices with thrust

阅读更多关于 Sorting packed vertices with thrust

问题 So I have an device array of PackedVertex structs: struct PackedVertex { glm::vec3 Vertex; glm::vec2 UV; glm::vec3 Normal; } I'm trying to sort them so that duplicates are clustered together in the array; I don't care about overall order at all. I've tried sorting them by comparing the lengths of the vectors which ran but didn't sort them correctly so now I'm trying per variable using 3 stable_sorts with the binary_operators: __thrust_hd_warning_disable__ struct sort_packed_verts_by_vertex :

how to get max blocks in thrust in cuda 5.5

阅读更多关于 how to get max blocks in thrust in cuda 5.5

问题 The Thrust function below can get the maximum blocks of for a CUDA launch CUDA 5.0, which is used by Sparse Matrix Vector multiplication(SpMV) in CUSP, and it is a technique for setting up execution for persistent threads. The first line is the header file. #include <thrust/detail/backend/cuda/arch.h> thrust::detail::backend::cuda::arch::max_active_blocks(kernel<float,int,VECTORS_PER_BLOCK,TH READS_PER_VECTOR>,THREADS_PER_BLOCK,(size_t)0) But the function is not supported by CUDA 5.5. Was

cuda thrust::remove_if throws “thrust::system::system_error” for device_vector?

阅读更多关于 cuda thrust::remove_if throws “thrust::system::system_error” for device_vector?

问题 I am currently using CUDA 7.5 under VS 2013. Today I needed to remove some of the elements from a device_vector , thus decided to use remove_if . But however I modify the code, the program just compiles well but throws "thrust::system::system_error" at run time. Firstly I tried my own code: int main() { thrust::host_vector<int> AA(10, 1); thrust::sequence(AA.begin(), AA.end()); thrust::host_vector<bool> SS(10,false); thrust::fill(SS.begin(), SS.begin() + 5, true); thrust::device_vector<int>

cuda-gdb crashes with thrust (CUDA release 5.5)

阅读更多关于 cuda-gdb crashes with thrust (CUDA release 5.5)

问题 I have the following trivial thrust::gather program (taken directly from the thrust::gather documentation) #include <thrust/gather.h> #include <thrust/device_vector.h> int main(void) { // mark even indices with a 1; odd indices with a 0 int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0}; thrust::device_vector<int> d_values(values, values + 10); // gather all even indices into the first half of the range // and odd indices to the last half of the range int map[10] = {0, 2, 4, 6, 8, 1, 3, 5, 7, 9}

Counting occurrences of numbers in a CUDA array

阅读更多关于 Counting occurrences of numbers in a CUDA array

问题 I have an array of unsigned integers stored on the GPU with CUDA (typically 1000000 elements). I would like to count the occurrence of every number in the array. There are only a few distinct numbers (about 10 ), but these numbers can span from 1 to 1000000 . About 9/10 th of the numbers are 0 , I don't need the count of them. The result looks something like this: 58458 -> 1000 occurrences 15 -> 412 occurrences I have an implementation using atomicAdd s, but it is too slow (a lot of threads

CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array

阅读更多关于 CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array

问题 Let's say I have two device_vector<byte> arrays, d_keys and d_data . If d_data is, for example, a flattened 2D 3x5 array ( e.g. { 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3 } ) and d_keys is a 1D array of size 5 ( e.g. { 1, 0, 0, 1, 1 } ), how can I do a reduction such that I'd end up only adding values on a per-row basis if the corresponding d_keys value is one ( e.g. ending up with a result of { 10, 23, 14 } )? The sum_rows.cu example allows me to add every value in d_data , but that's not

Efficiency of CUDA vector types (float2, float3, float4)

阅读更多关于 Efficiency of CUDA vector types (float2, float3, float4)

问题 I'm trying to understand the integrate_functor in particles_kernel.cu from CUDA examples: struct integrate_functor { float deltaTime; //constructor for functor //... template <typename Tuple> __device__ void operator()(Tuple t) { volatile float4 posData = thrust::get<2>(t); volatile float4 velData = thrust::get<3>(t); float3 pos = make_float3(posData.x, posData.y, posData.z); float3 vel = make_float3(velData.x, velData.y, velData.z); // update position and velocity // ... // store new

Thrust: sort_by_key slow due to memory allocation

阅读更多关于 Thrust: sort_by_key slow due to memory allocation

问题 I am doing a sort_by_key with key-value int arrays of size 80 million. The device is a GTX 560 Ti with 2GB VRAM. When the available (free) memory before the sort_by_key is 1200MB , it finishes sorting in 200ms . But, when the available memory drops to 600MB , the sort_by_key for the same key-value arrays takes 1.5-3s ! I ran the program under Compute Visual Profiler . I found that the GPU timestamp jumps by 1.5-3s between the last kernel before sort_by_key and the first kernel call inside