gpu

Shared memory matrix multiplication kernel

有些话、适合烂在心里 提交于 2019-12-09 22:38:21
问题 I am attempting to implement a shared memory based matrix multiplication kernel as outlined in the CUDA C Programming Guide. The following is the kernel: __global__ void matrixMultiplyShared(float * A, float * B, float * C, int ARows, int AColumns, int BRows, int BColumns, int CRows, int CColumns) { float * CSub = &C[CColumns * 16 * blockIdx.y + 16 * blockIdx.x]; float CValue = 0; for (int k = 0; k < (AColumns / 16); ++k) { float * ASub = &A[AColumns * 16 * blockIdx.y + 16 * k]; float * BSub

How to install plaidML / plaidML-keras

狂风中的少年 提交于 2019-12-09 19:39:38
问题 So I am trying to install plaidML-keras so I can do tensor-flow stuff on my MacBookPro's gpu (radeon pro 560x). From my research, it can be done using plaidML-Keras (instalation instrutions). When I run pip install -U plaidml-keras it works fine, but the next step, plaidml-setup returns the following error. Traceback (most recent call last): File "/usr/local/bin/plaidml-setup", line 6, in <module> from plaidml.plaidml_setup import main File "/usr/local/lib/python3.7/site-packages/plaidml/_

OpenGL GPU Memory cleanup, required?

China☆狼群 提交于 2019-12-09 17:01:05
问题 Do I have to clean up all DisplayLists, Textures, (Geometry-)Shaders and so on by hand via the glDelete* functions, or does the GPU mem get freed automagically when my Program exits/crashes? Note: GPU mem refers to dedicated memory on a dedicated Graphics card, not CPU memory. 回答1: Free the context, everything else is local to the context (unless you enabled display list sharing) and will go away along with it. 回答2: As others mentioned, your OS (in collaboration with the driver resource

CUDA Primes Generation

坚强是说给别人听的谎言 提交于 2019-12-09 13:15:32
问题 My CUDA program stops working(it prints nothing) as data size increases over 260k. Can someone tell me why this is happening? This is my first CUDA program. And if I want bigger primes, how to use datatype larger than long long int on CUDA? The graphics card is GT425M. #include<stdio.h> #include<stdlib.h> #include<cuda.h> #define SIZE 250000 #define BLOCK_NUM 96 #define THREAD_NUM 1024 int data[SIZE]; __global__ static void sieve(int *num,clock_t* time){ const int tid = threadIdx.x; const int

How to debug OpenCL on Nvidia GPUs?

荒凉一梦 提交于 2019-12-09 09:51:46
问题 Is there any way to debug OpenCL kernels on an Nvidia GPU, i.e. set breakpoints and inspect variables? My understanding is that Nvidia's tool does not allow OpenCL debugging, and AMD's and Intel's only allow it on their own devices. 回答1: gDEBugger might help you somewhat (never used it though), but other than that there isn't any tool that I know of that can set breakpoints or inspect variables inside a kernel. Perhaps try to save intermediate outputs from your kernel if it is a long kernel.

Resize 3D data in tensorflow like tf.image.resize_images

 ̄綄美尐妖づ 提交于 2019-12-09 06:58:25
问题 I need to resize some 3D data, like in the tf.image.resize_images method for 2d data. I was thinking I could try and run tf.image.resize_images on it in a loop and swap axes, but I thought there must be an easier way. Simple nearest neighbour should be fine. Any ideas? It's not ideal, but I could settle for the case where the data is just 0 or 1 and use something like: tf.where(boolMap, tf.fill(data_im*2, 0), tf.fill(data_im*2), 1) But I'm not sure how to get boolMap . Would use of tf.while

Opencl integration with Android

白昼怎懂夜的黑 提交于 2019-12-09 06:54:59
问题 I have searched a lot on google but I am unable to find a good documentation about integrating OpenCl with Android. I referred this link: https://aplacetogeek.wordpress.com/android-with-opencl-tutorial/ But this seems incomplete. Is anyone aware of how to go about doing things with OpenCl in Android? Also, example working code if any is also appreciated. I want to learn about it. 回答1: The similar questions have been asked before, I suggest you read the following pages first: How to use OpenCL

Android GPU profiling - OpenGL Live Wallpaper is slow

人走茶凉 提交于 2019-12-09 03:41:05
问题 I'm developing a Live Wallpaper using OpenGL ES 3.0. I've set up according to the excellent tutorial at http://www.learnopengles.com/how-to-use-opengl-es-2-in-an-android-live-wallpaper/, adapting GLSurfaceView and using it inside the Live Wallpaper. I have a decent knowledge of OpenGL/GLSL best practices, and I've set up a simple rendering pipeline where the draw loop is as tight as possible. No re-allocations, using one static VBO for non-changing data, a dynamic VBO for updates, using only

Fermi L2 cache hit latency?

独自空忆成欢 提交于 2019-12-09 00:57:39
问题 Does anyone know related information about L2 cache in Fermi? I have heard that it is as slow as global memory, and the use of L2 is just to enlarge the memory bandwidth. But I can't find any official source to confirm this. Did anyone measure the hit latency of L2? What about size, line size, and other paramters? In effect, how do L2 read misses affect the performance? In my sense, L2 only has a meaning in very memory-bound applications. Please feel free to give your opinions. Thanks 回答1:

GPU Programming, CUDA or OpenCL? [closed]

北战南征 提交于 2019-12-08 22:38:51
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am a newbie to GPU programming. I have a laptop with NVIDIA GeForce GT 640 card. I am faced with 2 dilemmas, suggestions are most welcome. If I go for CUDA -- Ubuntu or Windows Clearly CUDA is more suitable to windows while it can be a severe issue to install on Ubuntu. I have