gpgpu

reading GPU resource data by CPU

偶尔善良 提交于 2019-12-11 07:49:35
问题 i am learning directx11 these days. and i have been stuck in compute shader section. so i made four resource and three corresponding view. immutable input buffer = {1,1,1,1,1} / SRV immutable input buffer = {2,2,2,2,2} / SRV output buffer / UAV staging buffer for reading / No View and i succeeded to create all things, and dispatch cs function, and copy data from output buffer to staging buffer, and i read/check data. // INPUT BUFFER1-------------------------------------------------- const int

Multiple GPUs in CUDA 3.2 and issues with Cuda 4.0

给你一囗甜甜゛ 提交于 2019-12-11 07:43:06
问题 I am new to multiple GPUs. I have written a code for a single GPU and want to further speed up by use of multiple GPUs. I am working with two GTX 470 with MS VS 2008 and cuda toolkit 4.0 I am facing two problems. First problem is my code somehow doesn't run fine with 4.0 build rules and works fine for 3.2 build rules. Also the SDK example of multiGPU doesn't build on VS2008 giving error error C3861: 'cudaDeviceReset': identifier not found My second problem is, if I have to work with 3.2 then

What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

不羁的心 提交于 2019-12-11 07:23:55
问题 There is AMD HIP C++ which is very similar to CUDA C++. Also AMD created Hipify to convert CUDA C++ to HIP C++ (Portable C++ Code) which can be executed on both nVidia GPU and AMD GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP There are requirements to use shfl operations on nVidia GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia requirement for nvidia please make sure you have a 3.0 or higher compute

Data corruption when replacing uniform array with 1d texture in WebGL

若如初见. 提交于 2019-12-11 06:34:29
问题 I am doing some GPGPU processing on a large 4D input array in WebGL2. Initially, I just flattened the input array and passed it in as a uniform array of ints, with a custom accessor function in GLSL to translate 4D coordinates into an array index, as follows: const int SIZE = 5; // The largest dimension that works; if I can switch to textures, this will be passed in as a uniform value. const int SIZE2 = SIZE*SIZE; const int SIZE3 = SIZE*SIZE2; const int SIZE4 = SIZE*SIZE3; uniform int u_map

Reduction of matrix rows in OpenCL

半腔热情 提交于 2019-12-11 06:28:55
问题 I have an matrix which is stored as 1D array in the GPU, I'm trying to make an OpenCL kernel which will use reduction in every row of this matrix, for example: Let's consider my matrix is 2x3 with the elements [1, 2, 3, 4, 5, 6], what I want to do is: [1, 2, 3] = [ 6] [4, 5, 6] [15] Obviously as I'm talking about reduction, the actual return could be of more than one element per row: [1, 2, 3] = [3, 3] [4, 5, 6] [9, 6] Then the final calculation I can do in another kernel or in the CPU. Well,

Problems outputting gl_PrimitiveID to custom frame buffer object (FBO)

我们两清 提交于 2019-12-11 06:23:47
问题 I have a very basic fragment shader which I want to output 'gl_PrimitiveID' to a fragment buffer object (FBO) which I have defined. Below is my fragment shader: #version 150 uniform vec4 colorConst; out vec4 fragColor; out uvec4 triID; void main(void) { fragColor = colorConst; triID.r = uint(gl_PrimitiveID); } I setup my FBO like this: GLuint renderbufId0; GLuint renderbufId1; GLuint depthbufId; GLuint framebufId; // generate render and frame buffer objects glGenRenderbuffers( 1,

How make a stride chunk iterator thrust cuda

折月煮酒 提交于 2019-12-11 06:12:35
问题 I need a class iterator like this https://github.com/thrust/thrust/blob/master/examples/strided_range.cu but that this new iterator do the next sequence [k * size_stride, k * size_stride+1, ...,k * size_stride+size_chunk-1,...] with k = 0,1,...,N Example: size_stride = 8 size_chunk = 3 N = 3 then the sequence is [0,1,2,8,9,10,16,17,18,24,25,26] I don't know how do this efficiently... 回答1: The strided range interator is basically a carefully crafted permutation iterator with a functor that

GPGPU: Consequence of having a common PC in a warp

南笙酒味 提交于 2019-12-11 05:49:21
问题 I read in a book that in a wavefront or warp, all threads share a common program counter. So what is its consequence? Why does that matter? 回答1: NVIDIA GPUs execute 32-threads at a time (warps) and AMD GPUs execute 64-threads at time (wavefronts). The sharing of control logic, fetch, and data paths reduces area and increases perf/area and perf/watt. In order to take advantage of the design programming languages and developers need to understand how to coalesce memory accesses and how to

Is computing integral image on GPU really faster than on CPU?

眉间皱痕 提交于 2019-12-11 05:47:22
问题 I'm new to GPU computing, so this maybe a really naive question. I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea. However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. So I just wanna know your ideas about it, and some explanation if GPU is really faster. So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN). On GPU, follow the code

How is variable in device memory used by external function?

荒凉一梦 提交于 2019-12-11 05:38:45
问题 In this code: #include <iostream> void intfun(int * variable, int value){ #pragma acc parallel present(variable[:1]) num_gangs(1) num_workers(1) { *variable = value; } } int main(){ int var, value = 29; #pragma acc enter data create(var) copyin(value) intfun(&var,value); #pragma acc exit data copyout(var) delete(value) std::cout << var << std::endl; } How is int value recognized to be on device memory in intfun ? If I replace present(variable[:1]) by present(variable[:1],value) in the intfun