gpgpu | 易学教程

L2 cache in NVIDIA Fermi

阅读更多关于 L2 cache in NVIDIA Fermi

问题 When looking at the name of the performance counters in NVIDIA Fermi architecture (the file Compute_profiler.txt in the doc folder of cuda), I noticed that for L2 cache misses, there are two performance counters, l2_subp0_read_sector_misses and l2_subp1_read_sector_misses. They said that these are for two slices of L2. Why do they have two slices of L2? Is there any relation with the Streaming Multi-processor architecture? What would be the effect of this division to the performance? Thanks

How to quickly compact a sparse array with CUDA C?

阅读更多关于 How to quickly compact a sparse array with CUDA C?

问题 Summary Array [A - B - - - C] in device memory but want [A B C] - what's the quickest way with CUDA C? Context I have an array A of integers on device (GPU) memory. At each iteration, I randomly choose a few elements that are larger than 0 and subtract 1 from them. I maintain a sorted lookup array L of those elements that are equal to 0: Array A: @ iteration i: [0 1 0 3 3 2 0 1 2 3] @ iteration i + 1: [0 0 0 3 2 2 0 1 2 3] Lookup for 0-elements L: @ iteration i: [0 - 2 - - - 6 - - -] -> want

Poor opengl image processing performance

阅读更多关于 Poor opengl image processing performance

I'm trying to do some simple image processing using opengl. Since I couldn't find any good library that does this alrdy I've been trying to do my own solution. I simply want to compose a few images on the gpu and then read them back. However the performance of my implementation seems almost equal to what it takes do on the cpu... something is wrong... I've tried to follow the best practices I've found on the net. But still it's doing something wrong. I've tried removing all the irrelevant code. Any ideas as to why this implementation has poor performance? int image_width = 1280; int image

OpenGL Compute Shader Invocations

阅读更多关于 OpenGL Compute Shader Invocations

问题 I got a question related to the new compute shaders. I am currently working on a particle system. I store all my particles in shader-storage-buffer to access them in the compute shader. Then I dispatch an one dimensional work group. #define WORK_GROUP_SIZE 128 _shaderManager->useProgram("computeProg"); glDispatchCompute((_numParticles/WORK_GROUP_SIZE), 1, 1); glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); My compute shader: #version 430 struct particle{ vec4 currentPos; vec4 oldPos; };

How can Opengl Es be use for gpgpu implementation

阅读更多关于 How can Opengl Es be use for gpgpu implementation

问题 I want to use Opengl Es for gpgpu implementation of an image processing code. I want to know can I use Opengl Es for this purpose. If I can than which version of Opengl Es will be more appropriate for this purpose (Opengl Es 1.1 or 2.0). 回答1: OpenGL ES is a graphics technology for embedded systems, and therefor not quite as powerful as it's bigger brother. OpenGL ES was not designed with doing gpgpu processing in mind, but some algorithms, especially those that work on images and require per

CUDA: Does passing arguments to a kernel slow the kernel launch much?

阅读更多关于 CUDA: Does passing arguments to a kernel slow the kernel launch much?

问题 CUDA beginner here. In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch. My kernel launches look something like this: MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x); So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably

Multi GPU usage with CUDA Thrust

阅读更多关于 Multi GPU usage with CUDA Thrust

I want to use my two graphic cards for calculation with CUDA Thrust. I have two graphic cards. Running on single cards works well for both cards, even when I store two device_vectors in the std::vector. If I use both cards at the same time, the first cycle in the loop works and causes no error. After the first run it causes an error, probably because the device pointer is not valid. I am not sure what the exact problem is, or how to use both cards for calculation. Minimal code sample: std::vector<thrust::device_vector<float> > TEST() { std::vector<thrust::device_vector<float> > vRes; unsigned

Installed beignet to use OpenCL on Intel, but OpenCL programs only work when run as root

阅读更多关于 Installed beignet to use OpenCL on Intel, but OpenCL programs only work when run as root

I have an Intel HD graphics 4000 3rd Gen Processor, and my OS is Linux Mint 17.1 64 bit. I installed beignet to be able to use OpenCL and thus run programs on the GPU. I had been having lots of problems using the pyOpenCL bindings, so I just decided to uninstall my current beignet version and install the latest one (You can see the previous question I asked and answered myself about it here ). Upgrading beignet worked and I can now run OpenCL code on my GPU through python and C/C++ bindings. However, I can only run the programs as root, otherwise they don't detect my GPU as a valid device. The

How do you free up gpu memory?

阅读更多关于 How do you free up gpu memory?

When running theano, I get an error: not enough memory. See below. What are some possible actions that can be taken to free up memory? I know I can close applications etc, but I just want see if anyone has other ideas. For example, is it possible to reserve memory? THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_exp.py Using gpu device 0: GeForce GT 650M Trying to run under a GPU. If this is not desired, then modify network3.py to set the GPU flag to False. Error allocating 156800000 bytes of device memory (out of memory). Driver report 64192512 bytes free and 1073414144 bytes

DirectCompute versus OpenCL for GPU programming?

阅读更多关于 DirectCompute versus OpenCL for GPU programming?

I have some (financial) tasks which should map well to GPU computing, but I'm not really sure if I should go with OpenCL or DirectCompute. I did some GPU computing, but it was a long time ago (3 years). I did it through OpenGL since there was not really any alternative back then. I've seen some OpenCL presentations and it looks really nice. I haven't seen anything about DirectCompute yet, but I expect it to also be good. I'm not interested at the moment in cross-platform compatibility, and besides, I expect the two models to be similar enough to not cause a big headache when trying to go from