cub

Making CUB blockradixsort on-chip entirely?

◇◆丶佛笑我妖孽 提交于 2019-12-18 07:20:43
问题 I am reading the CUB documentations and examples: #include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh> __global__ void ExampleKernel(...) { // Specialize BlockRadixSort for 128 threads owning 4 integer items each typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort; // Allocate shared memory for BlockRadixSort __shared__ typename BlockRadixSort::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_keys[4]; ... /

Making CUB blockradixsort on-chip entirely?

。_饼干妹妹 提交于 2019-12-18 07:20:03
问题 I am reading the CUB documentations and examples: #include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh> __global__ void ExampleKernel(...) { // Specialize BlockRadixSort for 128 threads owning 4 integer items each typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort; // Allocate shared memory for BlockRadixSort __shared__ typename BlockRadixSort::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_keys[4]; ... /

Sorting many small arrays in CUDA

前提是你 提交于 2019-12-13 14:40:58
问题 I am implementing a median filter in CUDA. For a particular pixel, I extract its neighbors corresponding to a window around the pixel, say a N x N ( 3 x 3 ) window, and now have an array of N x N elements. I do not envision using a window of more than 10 x 10 elements for my application. This array is now locally present in the kernel and already loaded into device memory. From previous SO posts that I have read, the most common sorting algorithms are implemented by Thrust. But, Thrust can

Block reduction in CUDA

穿精又带淫゛_ 提交于 2019-12-10 02:45:16
问题 I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger ( 512 X 512 ) than a single block size. Here is the code. template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) +

Sorting (small) arrays by key in CUDA

左心房为你撑大大i 提交于 2019-12-07 02:59:27
问题 I'm trying to write a function that takes a block of unsorted key/value pairs such as <7, 4> <2, 8> <3, 1> <2, 2> <1, 5> <7, 1> <3, 8> <7, 2> and sorts them by key while reducing the values of pairs with the same key: <1, 5> <2, 10> <3, 9> <7, 7> Currently, I'm using a __device__ function like the one below which is essentially a bitonic sort that will combine values of the same key and set the old data to an infinitely large value (just using 99 for now) so that a subsequent bitonic sort

Block reduction in CUDA

岁酱吖の 提交于 2019-12-05 01:18:33
I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger ( 512 X 512 ) than a single block size. Here is the code. template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g

How to use CUB and Thrust in one CUDA code

扶醉桌前 提交于 2019-12-01 13:37:39
I'm trying to introduce some CUB into my "old" Thrust code, and so have started with a small example to compare thrust::reduce_by_key with cub::DeviceReduce::ReduceByKey , both applied to thrust::device_vectors . The thrust part of the code is fine, but the CUB part, which naively uses raw pointers obtained via thrust::raw_pointer_cast, crashes after the CUB calls. I put in a cudaDeviceSynchronize() to try to solve this problem, but it didn't help. The CUB part of the code was cribbed from the CUB web pages. On OSX the runtime error is: libc++abi.dylib: terminate called throwing an exception

How to use CUB and Thrust in one CUDA code

风流意气都作罢 提交于 2019-12-01 09:45:12
问题 I'm trying to introduce some CUB into my "old" Thrust code, and so have started with a small example to compare thrust::reduce_by_key with cub::DeviceReduce::ReduceByKey , both applied to thrust::device_vectors . The thrust part of the code is fine, but the CUB part, which naively uses raw pointers obtained via thrust::raw_pointer_cast, crashes after the CUB calls. I put in a cudaDeviceSynchronize() to try to solve this problem, but it didn't help. The CUB part of the code was cribbed from

Why is my inclusive scan code 2x faster on CPU than on a GPU?

十年热恋 提交于 2019-11-28 14:43:20
I wrote a short CUDA program that uses the highly-optimized CUB library to demonstrate that one core from an old, quad-core Intel Q6600 processor (all four are supposedly capable of ~30 GFLOPS/sec) can do an inclusive scan (or cumulative/prefix sum if you rather) on 100,000 elements faster than an Nvidia 750 Ti (supposedly capable of 1306 GFLOPS/sec of single precision). Why is this the case? The source code is: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <cub/cub.cuh> #include <stdio.h> #include <time.h> #include <algorithm> #define gpuErrchk(ans) { gpuAssert((ans

cub BlockRadixSort: how to deal with large tile size or sort multiple tiles?

╄→尐↘猪︶ㄣ 提交于 2019-11-28 06:24:17
问题 When using cub::BlockRadixSort to do the sorting within a block, if the number of elements is too large, how do we deal with that? If we set a tile size to be too large, the shared memory for the temporary storage will soon not able to hold it. If we split it into multiple tiles, how do we post-process it after we sorted each tile? 回答1: Caveat: I am not a cub expert (far from it). You might want to review this question/answer as I'm building on some of the work I did there. Certainly if the