gpgpu | 易学教程

Do warp vote functions synchronize threads in the warp?

阅读更多关于 Do warp vote functions synchronize threads in the warp?

问题 Do CUDA warp vote functions, such as __ any() and __ all() , synchronize threads in the warp? In other words, is there any guarantee that all threads inside the warp execute instructions preceding warp vote function, especially the instruction(s) that manipulate the predicate? 回答1: The synchronization is implicit, since threads within a warp execute in lockstep. [*] Code that relies on this behavior is known as "warp synchronous." [*] If you are thinking that conditional code will cause

3D Buffers in HLSL?

阅读更多关于 3D Buffers in HLSL?

问题 I wanna send a series of integers to HLSL in the form of a 3D array using unity. I've been trying to do this for a couple of days now, but without any gain. I tried to pack the buffers into each other ( StructuredBuffer<StructuredBuffer<StructuredBuffer<int>>> ), but it simply won't work. And I need to make this thing resizable, so I can't use arrays in struct s. What should I do? EDIT: To clarify a bit more what I am trying to do here, this is a medical program. When you go make a scan of

OpenCL SDK overview and hardware interoperability

阅读更多关于 OpenCL SDK overview and hardware interoperability

问题 I am a little bit confused of the overall situation when it comes to OpenCL development so I'll just state my current understanding and questions as a list. Please correct me if I'm wrong. I know there are SDKs ("Platforms") by Intel, AMD (and I guess there is also OpenCL support in the Nvidia SDK?) Are there SDKs by other vendors? Will the SDK of one vendor support the devices of another? e.g. Nvidia devices with AMD sdk? I am able to run programs on my Intel CPU using AMD SDK. Is it the way

Is there an efficient way to optimize my serialized code?

阅读更多关于 Is there an efficient way to optimize my serialized code?

问题 This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth? I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below. while (i < n) { if (array[i] != NULL) { return array[i]; } i++; }

Best approach for convolution of multiple small matrices using CUDA

阅读更多关于 Best approach for convolution of multiple small matrices using CUDA

问题 I need to preform multiple convolutions with small matrices and kernels, and I was hoping that utilizing the many processors of the GPU would enable me to it as fast as possible. The problem is as follows: I have many matrices (~1,000 to ~10,000) or relatively small sizes (~15x15 down to 1x1 - as in scalar), and a certain number of convolution masks (~20 to 1). I need to convolve all the matrices with each convolution mask example: A; %5,000 matrices of size 10x10, A(i) = a 10x10 matrix B; 10

Simulating pipeline program with CUDA

阅读更多关于 Simulating pipeline program with CUDA

问题 Say I have two arrays A and B and a kernel1 that does some calculation on both arrays (vector addition for example) by breaking the arrays into different chunks and and writes the partial result to C . kernel1 then keeps doing this until all elements in the arrays are processed. unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; unsigned int gridSize = blockDim.x*gridDim.x; //iterate through each chunk of gridSize in both A and B while (i < N) { C[i] = A[i] + B[i]; i += gridSize; } Say,

Compile and build .cl file using NVIDIA's nvcc Compiler?

阅读更多关于 Compile and build .cl file using NVIDIA's nvcc Compiler?

问题 Is it possible to compile .cl file using NVIDIA's nvcc compiler?? I am trying to set up visual studio 2010 to code Opencl under CUDA platform. But when I select CUDA C/C++ Compiler to compile and build .cl file, it gives me errors like nvcc does not exist. What is the issue? 回答1: You should be able to use nvcc to compile OpenCL codes. Normally, I would suggest using a filename extension of .c for a C-compliant code, and .cpp for a C++ compliant code(*), however nvcc has filename extension

Does the official OpenCL 2.2 standard support the WaveFront?

阅读更多关于 Does the official OpenCL 2.2 standard support the WaveFront?

问题 As known, AMD-OpenCL supports WaveFront (August 2015): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf The AMD Radeon HD 7770 GPU, for example, supports more than 25,000 in-flight work-items and can switch to a new wavefront (containing up to 64 work-items) in a single cycle. But why in the OpenCL standards 1.0/2.0/2.2 there is no mention about the WaveFront? None of the PDF has not a word WaveFront : https://www.khronos.org

boost::compute stream compaction

阅读更多关于 boost::compute stream compaction

问题 How to do stream compaction with boost::compute? E.g. if you want to perform heavy operation only on certain elements in the array. First you generate mask array with ones corresponding to elements for which you want to perform operation: mask = [0 0 0 1 1 0 1 0 1] Then perform exclusive scan (prefix sum) of mask array to get: scan = [0 0 0 0 1 2 2 3 3] Then compact this array with: if (mask[i]) inds[scan[i]] = i; To get final array of compacted indices (inds): [3 4 6 8] Size of the final

OpenGL 4.0 GPU Draw Feature?

阅读更多关于 OpenGL 4.0 GPU Draw Feature?

问题 In Wikipedia and other sources' description of OpenGL 4.0 I read about this feature: Drawing of data generated by OpenGL or external APIs such as OpenCL, without CPU intervention. What is this referring to? Edit : Seems like this must be referring to Draw_Indirect which I believe somehow extends the draw phase to include feedback from shader programs or programs from interop (OpenCL/CUDA basically) It looks as if there are a few caveats and tricks to getting the calls to keep staying on the