gpu | 易学教程

Shared memory matrix multiplication kernel

阅读更多关于 Shared memory matrix multiplication kernel

问题 I am attempting to implement a shared memory based matrix multiplication kernel as outlined in the CUDA C Programming Guide. The following is the kernel: __global__ void matrixMultiplyShared(float * A, float * B, float * C, int ARows, int AColumns, int BRows, int BColumns, int CRows, int CColumns) { float * CSub = &C[CColumns * 16 * blockIdx.y + 16 * blockIdx.x]; float CValue = 0; for (int k = 0; k < (AColumns / 16); ++k) { float * ASub = &A[AColumns * 16 * blockIdx.y + 16 * k]; float * BSub

How to install plaidML / plaidML-keras

阅读更多关于 How to install plaidML / plaidML-keras

问题 So I am trying to install plaidML-keras so I can do tensor-flow stuff on my MacBookPro's gpu (radeon pro 560x). From my research, it can be done using plaidML-Keras (instalation instrutions). When I run pip install -U plaidml-keras it works fine, but the next step, plaidml-setup returns the following error. Traceback (most recent call last): File "/usr/local/bin/plaidml-setup", line 6, in <module> from plaidml.plaidml_setup import main File "/usr/local/lib/python3.7/site-packages/plaidml/_

OpenGL GPU Memory cleanup, required?

阅读更多关于 OpenGL GPU Memory cleanup, required?

问题 Do I have to clean up all DisplayLists, Textures, (Geometry-)Shaders and so on by hand via the glDelete* functions, or does the GPU mem get freed automagically when my Program exits/crashes? Note: GPU mem refers to dedicated memory on a dedicated Graphics card, not CPU memory. 回答1: Free the context, everything else is local to the context (unless you enabled display list sharing) and will go away along with it. 回答2: As others mentioned, your OS (in collaboration with the driver resource

CUDA Primes Generation

阅读更多关于 CUDA Primes Generation

问题 My CUDA program stops working(it prints nothing) as data size increases over 260k. Can someone tell me why this is happening? This is my first CUDA program. And if I want bigger primes, how to use datatype larger than long long int on CUDA? The graphics card is GT425M. #include<stdio.h> #include<stdlib.h> #include<cuda.h> #define SIZE 250000 #define BLOCK_NUM 96 #define THREAD_NUM 1024 int data[SIZE]; __global__ static void sieve(int *num,clock_t* time){ const int tid = threadIdx.x; const int

How to debug OpenCL on Nvidia GPUs?

阅读更多关于 How to debug OpenCL on Nvidia GPUs?

问题 Is there any way to debug OpenCL kernels on an Nvidia GPU, i.e. set breakpoints and inspect variables? My understanding is that Nvidia's tool does not allow OpenCL debugging, and AMD's and Intel's only allow it on their own devices. 回答1: gDEBugger might help you somewhat (never used it though), but other than that there isn't any tool that I know of that can set breakpoints or inspect variables inside a kernel. Perhaps try to save intermediate outputs from your kernel if it is a long kernel.

Resize 3D data in tensorflow like tf.image.resize_images

阅读更多关于 Resize 3D data in tensorflow like tf.image.resize_images

问题 I need to resize some 3D data, like in the tf.image.resize_images method for 2d data. I was thinking I could try and run tf.image.resize_images on it in a loop and swap axes, but I thought there must be an easier way. Simple nearest neighbour should be fine. Any ideas? It's not ideal, but I could settle for the case where the data is just 0 or 1 and use something like: tf.where(boolMap, tf.fill(data_im*2, 0), tf.fill(data_im*2), 1) But I'm not sure how to get boolMap . Would use of tf.while

Opencl integration with Android

阅读更多关于 Opencl integration with Android

问题 I have searched a lot on google but I am unable to find a good documentation about integrating OpenCl with Android. I referred this link: https://aplacetogeek.wordpress.com/android-with-opencl-tutorial/ But this seems incomplete. Is anyone aware of how to go about doing things with OpenCl in Android? Also, example working code if any is also appreciated. I want to learn about it. 回答1: The similar questions have been asked before, I suggest you read the following pages first: How to use OpenCL

Android GPU profiling - OpenGL Live Wallpaper is slow

阅读更多关于 Android GPU profiling - OpenGL Live Wallpaper is slow

问题 I'm developing a Live Wallpaper using OpenGL ES 3.0. I've set up according to the excellent tutorial at http://www.learnopengles.com/how-to-use-opengl-es-2-in-an-android-live-wallpaper/, adapting GLSurfaceView and using it inside the Live Wallpaper. I have a decent knowledge of OpenGL/GLSL best practices, and I've set up a simple rendering pipeline where the draw loop is as tight as possible. No re-allocations, using one static VBO for non-changing data, a dynamic VBO for updates, using only

Fermi L2 cache hit latency?

阅读更多关于 Fermi L2 cache hit latency?

问题 Does anyone know related information about L2 cache in Fermi? I have heard that it is as slow as global memory, and the use of L2 is just to enlarge the memory bandwidth. But I can't find any official source to confirm this. Did anyone measure the hit latency of L2? What about size, line size, and other paramters? In effect, how do L2 read misses affect the performance? In my sense, L2 only has a meaning in very memory-bound applications. Please feel free to give your opinions. Thanks 回答1:

GPU Programming, CUDA or OpenCL? [closed]

阅读更多关于 GPU Programming, CUDA or OpenCL? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am a newbie to GPU programming. I have a laptop with NVIDIA GeForce GT 640 card. I am faced with 2 dilemmas, suggestions are most welcome. If I go for CUDA -- Ubuntu or Windows Clearly CUDA is more suitable to windows while it can be a severe issue to install on Ubuntu. I have