cub | 易学教程

Making CUB blockradixsort on-chip entirely?

阅读更多关于 Making CUB blockradixsort on-chip entirely?

问题 I am reading the CUB documentations and examples: #include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh> __global__ void ExampleKernel(...) { // Specialize BlockRadixSort for 128 threads owning 4 integer items each typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort; // Allocate shared memory for BlockRadixSort __shared__ typename BlockRadixSort::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_keys[4]; ... /

Making CUB blockradixsort on-chip entirely?

阅读更多关于 Making CUB blockradixsort on-chip entirely?

Sorting many small arrays in CUDA

阅读更多关于 Sorting many small arrays in CUDA

问题 I am implementing a median filter in CUDA. For a particular pixel, I extract its neighbors corresponding to a window around the pixel, say a N x N ( 3 x 3 ) window, and now have an array of N x N elements. I do not envision using a window of more than 10 x 10 elements for my application. This array is now locally present in the kernel and already loaded into device memory. From previous SO posts that I have read, the most common sorting algorithms are implemented by Thrust. But, Thrust can

Block reduction in CUDA

阅读更多关于 Block reduction in CUDA

问题 I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger ( 512 X 512 ) than a single block size. Here is the code. template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) +

Sorting (small) arrays by key in CUDA

阅读更多关于 Sorting (small) arrays by key in CUDA

问题 I'm trying to write a function that takes a block of unsorted key/value pairs such as <7, 4> <2, 8> <3, 1> <2, 2> <1, 5> <7, 1> <3, 8> <7, 2> and sorts them by key while reducing the values of pairs with the same key: <1, 5> <2, 10> <3, 9> <7, 7> Currently, I'm using a __device__ function like the one below which is essentially a bitonic sort that will combine values of the same key and set the old data to an infinitely large value (just using 99 for now) so that a subsequent bitonic sort

Block reduction in CUDA

阅读更多关于 Block reduction in CUDA

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger ( 512 X 512 ) than a single block size. Here is the code. template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g

How to use CUB and Thrust in one CUDA code

阅读更多关于 How to use CUB and Thrust in one CUDA code

I'm trying to introduce some CUB into my "old" Thrust code, and so have started with a small example to compare thrust::reduce_by_key with cub::DeviceReduce::ReduceByKey , both applied to thrust::device_vectors . The thrust part of the code is fine, but the CUB part, which naively uses raw pointers obtained via thrust::raw_pointer_cast, crashes after the CUB calls. I put in a cudaDeviceSynchronize() to try to solve this problem, but it didn't help. The CUB part of the code was cribbed from the CUB web pages. On OSX the runtime error is: libc++abi.dylib: terminate called throwing an exception

How to use CUB and Thrust in one CUDA code

阅读更多关于 How to use CUB and Thrust in one CUDA code

问题 I'm trying to introduce some CUB into my "old" Thrust code, and so have started with a small example to compare thrust::reduce_by_key with cub::DeviceReduce::ReduceByKey , both applied to thrust::device_vectors . The thrust part of the code is fine, but the CUB part, which naively uses raw pointers obtained via thrust::raw_pointer_cast, crashes after the CUB calls. I put in a cudaDeviceSynchronize() to try to solve this problem, but it didn't help. The CUB part of the code was cribbed from

Why is my inclusive scan code 2x faster on CPU than on a GPU?

阅读更多关于 Why is my inclusive scan code 2x faster on CPU than on a GPU?

I wrote a short CUDA program that uses the highly-optimized CUB library to demonstrate that one core from an old, quad-core Intel Q6600 processor (all four are supposedly capable of ~30 GFLOPS/sec) can do an inclusive scan (or cumulative/prefix sum if you rather) on 100,000 elements faster than an Nvidia 750 Ti (supposedly capable of 1306 GFLOPS/sec of single precision). Why is this the case? The source code is: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <cub/cub.cuh> #include <stdio.h> #include <time.h> #include <algorithm> #define gpuErrchk(ans) { gpuAssert((ans

cub BlockRadixSort: how to deal with large tile size or sort multiple tiles?

阅读更多关于 cub BlockRadixSort: how to deal with large tile size or sort multiple tiles?

问题 When using cub::BlockRadixSort to do the sorting within a block, if the number of elements is too large, how do we deal with that? If we set a tile size to be too large, the shared memory for the temporary storage will soon not able to hold it. If we split it into multiple tiles, how do we post-process it after we sorted each tile? 回答1: Caveat: I am not a cub expert (far from it). You might want to review this question/answer as I'm building on some of the work I did there. Certainly if the