gpgpu | 易学教程

reading GPU resource data by CPU

阅读更多关于 reading GPU resource data by CPU

问题 i am learning directx11 these days. and i have been stuck in compute shader section. so i made four resource and three corresponding view. immutable input buffer = {1,1,1,1,1} / SRV immutable input buffer = {2,2,2,2,2} / SRV output buffer / UAV staging buffer for reading / No View and i succeeded to create all things, and dispatch cs function, and copy data from output buffer to staging buffer, and i read/check data. // INPUT BUFFER1-------------------------------------------------- const int

Multiple GPUs in CUDA 3.2 and issues with Cuda 4.0

阅读更多关于 Multiple GPUs in CUDA 3.2 and issues with Cuda 4.0

问题 I am new to multiple GPUs. I have written a code for a single GPU and want to further speed up by use of multiple GPUs. I am working with two GTX 470 with MS VS 2008 and cuda toolkit 4.0 I am facing two problems. First problem is my code somehow doesn't run fine with 4.0 build rules and works fine for 3.2 build rules. Also the SDK example of multiGPU doesn't build on VS2008 giving error error C3861: 'cudaDeviceReset': identifier not found My second problem is, if I have to work with 3.2 then

What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

阅读更多关于 What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

问题 There is AMD HIP C++ which is very similar to CUDA C++. Also AMD created Hipify to convert CUDA C++ to HIP C++ (Portable C++ Code) which can be executed on both nVidia GPU and AMD GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP There are requirements to use shfl operations on nVidia GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia requirement for nvidia please make sure you have a 3.0 or higher compute

Data corruption when replacing uniform array with 1d texture in WebGL

阅读更多关于 Data corruption when replacing uniform array with 1d texture in WebGL

问题 I am doing some GPGPU processing on a large 4D input array in WebGL2. Initially, I just flattened the input array and passed it in as a uniform array of ints, with a custom accessor function in GLSL to translate 4D coordinates into an array index, as follows: const int SIZE = 5; // The largest dimension that works; if I can switch to textures, this will be passed in as a uniform value. const int SIZE2 = SIZE*SIZE; const int SIZE3 = SIZE*SIZE2; const int SIZE4 = SIZE*SIZE3; uniform int u_map

Reduction of matrix rows in OpenCL

阅读更多关于 Reduction of matrix rows in OpenCL

问题 I have an matrix which is stored as 1D array in the GPU, I'm trying to make an OpenCL kernel which will use reduction in every row of this matrix, for example: Let's consider my matrix is 2x3 with the elements [1, 2, 3, 4, 5, 6], what I want to do is: [1, 2, 3] = [ 6] [4, 5, 6] [15] Obviously as I'm talking about reduction, the actual return could be of more than one element per row: [1, 2, 3] = [3, 3] [4, 5, 6] [9, 6] Then the final calculation I can do in another kernel or in the CPU. Well,

Problems outputting gl_PrimitiveID to custom frame buffer object (FBO)

阅读更多关于 Problems outputting gl_PrimitiveID to custom frame buffer object (FBO)

问题 I have a very basic fragment shader which I want to output 'gl_PrimitiveID' to a fragment buffer object (FBO) which I have defined. Below is my fragment shader: #version 150 uniform vec4 colorConst; out vec4 fragColor; out uvec4 triID; void main(void) { fragColor = colorConst; triID.r = uint(gl_PrimitiveID); } I setup my FBO like this: GLuint renderbufId0; GLuint renderbufId1; GLuint depthbufId; GLuint framebufId; // generate render and frame buffer objects glGenRenderbuffers( 1,

How make a stride chunk iterator thrust cuda

阅读更多关于 How make a stride chunk iterator thrust cuda

问题 I need a class iterator like this https://github.com/thrust/thrust/blob/master/examples/strided_range.cu but that this new iterator do the next sequence [k * size_stride, k * size_stride+1, ...,k * size_stride+size_chunk-1,...] with k = 0,1,...,N Example: size_stride = 8 size_chunk = 3 N = 3 then the sequence is [0,1,2,8,9,10,16,17,18,24,25,26] I don't know how do this efficiently... 回答1: The strided range interator is basically a carefully crafted permutation iterator with a functor that

GPGPU: Consequence of having a common PC in a warp

阅读更多关于 GPGPU: Consequence of having a common PC in a warp

问题 I read in a book that in a wavefront or warp, all threads share a common program counter. So what is its consequence? Why does that matter? 回答1: NVIDIA GPUs execute 32-threads at a time (warps) and AMD GPUs execute 64-threads at time (wavefronts). The sharing of control logic, fetch, and data paths reduces area and increases perf/area and perf/watt. In order to take advantage of the design programming languages and developers need to understand how to coalesce memory accesses and how to

Is computing integral image on GPU really faster than on CPU?

阅读更多关于 Is computing integral image on GPU really faster than on CPU?

问题 I'm new to GPU computing, so this maybe a really naive question. I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea. However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. So I just wanna know your ideas about it, and some explanation if GPU is really faster. So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN). On GPU, follow the code

How is variable in device memory used by external function?

阅读更多关于 How is variable in device memory used by external function?

问题 In this code: #include <iostream> void intfun(int * variable, int value){ #pragma acc parallel present(variable[:1]) num_gangs(1) num_workers(1) { *variable = value; } } int main(){ int var, value = 29; #pragma acc enter data create(var) copyin(value) intfun(&var,value); #pragma acc exit data copyout(var) delete(value) std::cout << var << std::endl; } How is int value recognized to be on device memory in intfun ? If I replace present(variable[:1]) by present(variable[:1],value) in the intfun