gpgpu | 易学教程

General purpose compute with Vertex/Pixel shaders (Open GL / DirectX)

阅读更多关于 General purpose compute with Vertex/Pixel shaders (Open GL / DirectX)

问题 I have a question regarding compute shaders. are compute shaders available in DX 9? would it be still possible to use a compute shader with a DX9 driver if there is no compute shader fragment on the GPU? ( SGX 545 does not have it, but SGX 6X generation is going to have it), as far as what IMG says. I would like to know If i can do some simple general purpose programming on SGXs GPUs with DirectX9 or OpenGL drivers. Also, is there anyway I can use OpenGL vertex shaders for GPGPU programming?

How is a CUDA kernel launched?

阅读更多关于 How is a CUDA kernel launched?

问题 I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices. I know this is a very basic concept, but I don't know this. I am confused regarding the flow. 回答1: You launch a grid of blocks. Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor

How to debug OpenCL on Nvidia GPUs?

阅读更多关于 How to debug OpenCL on Nvidia GPUs?

问题 Is there any way to debug OpenCL kernels on an Nvidia GPU, i.e. set breakpoints and inspect variables? My understanding is that Nvidia's tool does not allow OpenCL debugging, and AMD's and Intel's only allow it on their own devices. 回答1: gDEBugger might help you somewhat (never used it though), but other than that there isn't any tool that I know of that can set breakpoints or inspect variables inside a kernel. Perhaps try to save intermediate outputs from your kernel if it is a long kernel.

CUDA - why is warp based parallel reduction slower?

阅读更多关于 CUDA - why is warp based parallel reduction slower?

问题 I had the idea about a warp based parallel reduction since all threads of a warp are in sync by definition. So the idea was that the input data can be reduced by factor 64 (each thread reduces two elements) without any synchronization need. Same as the original implementation by Mark Harris the reduction is applied on block-level and data is on shared memory. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf I created a kernel to test his version and my warp based version.

Untrusted GPGPU code (OpenCL etc) - is it safe? What risks?

阅读更多关于 Untrusted GPGPU code (OpenCL etc) - is it safe? What risks?

问题 There are many approaches when it goes about running untrusted code on typical CPU : sandboxes, fake-roots, virtualization... What about untrusted code for GPGPU (OpenCL,cuda or already compiled one) ? Assuming that memory on graphics card is cleared before running such third-party untrusted code, are there any security risks? What kind of risks? Any way to prevent them ? Is sandboxing possible / available on gpgpu ? maybe binary instrumentation? other techniques? P.S. I am more interested in

std::vector to array in CUDA

阅读更多关于 std::vector to array in CUDA

问题 Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels? It is declared as: vector<vector<int>> information; I want to cudaMalloc and copy from host to device, what would be the best way to do it? int *d_information; cudaMalloc((void**)&d_information, sizeof(int)*size); cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice); 回答1: In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about

Fermi L2 cache hit latency?

阅读更多关于 Fermi L2 cache hit latency?

问题 Does anyone know related information about L2 cache in Fermi? I have heard that it is as slow as global memory, and the use of L2 is just to enlarge the memory bandwidth. But I can't find any official source to confirm this. Did anyone measure the hit latency of L2? What about size, line size, and other paramters? In effect, how do L2 read misses affect the performance? In my sense, L2 only has a meaning in very memory-bound applications. Please feel free to give your opinions. Thanks 回答1:

GPU Programming, CUDA or OpenCL? [closed]

阅读更多关于 GPU Programming, CUDA or OpenCL? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am a newbie to GPU programming. I have a laptop with NVIDIA GeForce GT 640 card. I am faced with 2 dilemmas, suggestions are most welcome. If I go for CUDA -- Ubuntu or Windows Clearly CUDA is more suitable to windows while it can be a severe issue to install on Ubuntu. I have

Fast rasterizing of text and vector art

阅读更多关于 Fast rasterizing of text and vector art

问题 Suppose there is a lot of vector shapes (Bezier curves which determine the boundary of a shape). For example a page full of tiny letters. What is the fastest way to create a bitmap out of it? I once saw a demo several years ago (can't find it now) where some guys used GPU to rasterize the vector art - they were able to zoom in/out of the page in real-time. What is the current state of GPU rendering of Bezier shapes? Is it really fast? Faster than CPU? What are the common and not-so-common

CUDA-parallelized raytracer: very low speedup

阅读更多关于 CUDA-parallelized raytracer: very low speedup

I'm coding a raytracer using (py)CUDA and I'm obtaining a really low speedup; for example, in a 1000x1000 image, the GPU-parallelized code is just 4 times faster than the sequential code, executed in the CPU. For each ray I have to solve 5 equations (the raytracer generates images of black holes using the process described in this paper ), so my setup is the following: each ray is computed in a separate block, where 5 threads compute the equations using shared memory. That is, if I want to generate an image with a width of W pixels and a height of H pixels, the setup is: Grid: W blocks x H