gpgpu | 易学教程

How do you free up gpu memory?

阅读更多关于 How do you free up gpu memory?

问题 When running theano, I get an error: not enough memory. See below. What are some possible actions that can be taken to free up memory? I know I can close applications etc, but I just want see if anyone has other ideas. For example, is it possible to reserve memory? THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_exp.py Using gpu device 0: GeForce GT 650M Trying to run under a GPU. If this is not desired, then modify network3.py to set the GPU flag to False. Error allocating

Can I use Quadro K4000 and K2000 for GPUDirect v2 Peer-to-peer (P2P) communictation?

阅读更多关于 Can I use Quadro K4000 and K2000 for GPUDirect v2 Peer-to-peer (P2P) communictation?

问题 I use: Single CPU (Intel Core i7-4820K Ivy Bridge-E) 40 Lanes of PCIe 3.0 + MotherBoard MSI X79A-GD65 (8D) WindowsServer 2012, MSVS 2012 + CUDA 5.5 and compiled as 64-bit application GPUs nVidia Quadro K4000 and K2000 All Quadros in TCC-mode (Tesla Compute Cluster) nVidia Video Driver 332.50 simpleP2P-test shown that, all Quadros K4000 and K4000 - IS capable of Peer-to-Peer (P2P), but Peer-to-Peer (P2P) access - Quadro K4000 (GPU0) <-> Quadro K2000 (GPU1) : No. Can I use Quadro K4000 and

Multi GPU usage with CUDA Thrust

阅读更多关于 Multi GPU usage with CUDA Thrust

问题 I want to use my two graphic cards for calculation with CUDA Thrust. I have two graphic cards. Running on single cards works well for both cards, even when I store two device_vectors in the std::vector. If I use both cards at the same time, the first cycle in the loop works and causes no error. After the first run it causes an error, probably because the device pointer is not valid. I am not sure what the exact problem is, or how to use both cards for calculation. Minimal code sample: std:

OpenCL creating wrong colours

阅读更多关于 OpenCL creating wrong colours

问题 I'm having an issue with an OpenCL image filter I've been trying to get working. I've written many of these before (Sobel Edge Detection, Auto Segmentation, and such), so I thought I knew how to do this, but the following code is giving me some really weird output: //NoRedPixels.cl __kernel void NoRedPixels( __read_only image2d_t srcImg, __write_only image2d_t dstImg, sampler_t sampler, int width, int height, int threshold, int colour, int fill) { int2 imageCoordinate = (int2)(get_global_id(0

Poor opengl image processing performance

阅读更多关于 Poor opengl image processing performance

问题 I'm trying to do some simple image processing using opengl. Since I couldn't find any good library that does this alrdy I've been trying to do my own solution. I simply want to compose a few images on the gpu and then read them back. However the performance of my implementation seems almost equal to what it takes do on the cpu... something is wrong... I've tried to follow the best practices I've found on the net. But still it's doing something wrong. I've tried removing all the irrelevant

DirectCompute versus OpenCL for GPU programming?

阅读更多关于 DirectCompute versus OpenCL for GPU programming?

问题 I have some (financial) tasks which should map well to GPU computing, but I'm not really sure if I should go with OpenCL or DirectCompute. I did some GPU computing, but it was a long time ago (3 years). I did it through OpenGL since there was not really any alternative back then. I've seen some OpenCL presentations and it looks really nice. I haven't seen anything about DirectCompute yet, but I expect it to also be good. I'm not interested at the moment in cross-platform compatibility, and

Generalized Hough Transform in CUDA - How can I speed up the binning process?

阅读更多关于 Generalized Hough Transform in CUDA - How can I speed up the binning process?

问题 Like the title says, I'm working on a little personal research into parallel computer vision techniques. Using CUDA, I am trying to implement a GPGPU version of the Hough transform. The only problem that I've encountered is during the voting process. I'm calling atomicAdd() to prevent multiple, simultaneously write operations and I don't seem to be gaining too much performance efficiency. I've searched the web, but haven't found any way to noticeably enhance the performance of the voting

Is there a CUDA smart pointer?

阅读更多关于 Is there a CUDA smart pointer?

问题 If not, what is the standard way to free up cudaMalloc ed memory when an exception is thrown? (Note that I am unable to use Thrust.) 回答1: You can use RAII idiom and put your cudaMalloc() and cudaFree() calls to the constructor and destructor of your object respectively. Once the exception is thrown your destructor will be called which will free the allocated memory. If you wrap this object into a smart-pointer (or make it behave like a pointer) you will get your CUDA smart-pointer. 来源： https:

Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

阅读更多关于 Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

问题 Lets assume that I have a computer which has a multicore processor and a GPU. I would like to write an OpenCL program which runs on all cores of the platform. Is this possible or do I need to choose a single device on which to run the kernel? 回答1: In theory yes, you can, the CL API allows it. But the platform/implementation must support it, and i don't think most CL implementatations do. To do it, get the cl_device_id of the CPU device and the GPU device, and create a context with those two

Which memory access pattern is more efficient for a cached GPU?

阅读更多关于 Which memory access pattern is more efficient for a cached GPU?

问题 So lets say I have a global array of memory: |a|b|c| |e|f|g| |i|j|k| | There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads): 0 -> 1 -> 2 -> 3 t1 a -> b -> c -> . t2 e -> f -> g -> . t3 i -> j -> k -> . t4 . . . `> . The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would