gpu | 易学教程

OpenCL and GPU global synchronization

阅读更多关于 OpenCL and GPU global synchronization

问题 Has anyone tried the gpu_sync functions described in the article "Inter-Block GPU Communication via Fast Barrier Synchronization"? All the codes described seems pretty simple and easy to implement but it keeps freezing up my GPU. I'm sure I'm doing something stupid but I can't see what. Can anyone help me? The strategy I'm using is the one described in the section “GPU Lock-Free Synchronization” and here is the OpenCL source code I've implemented: static void globalSync(uint iGoalValue,

Number of total threads, blocks, and grids on my GPU.

阅读更多关于 Number of total threads, blocks, and grids on my GPU.

问题 For the NVIDIA GEFORCE 940mx GPU, Device Query shows it has 3 Multiprocessor and 128 cores for each MP. Number of threads per multiprocessor=2048 So, 3*2048=6144.ie. total 6144 threads in GPU. 6144/1024=6 ,ie. total 6 blocks. And warp size is 32. But from this video https://www.youtube.com/watch?v=kzXjRFL-gjo i found that each GPU has limit on threads, but no limit on Number of blocks. So i got confused with this. I would like to know How many total threads are in my GPU? Can we use all

Distributed Tensorflow device placement in Google Cloud ML engine

阅读更多关于 Distributed Tensorflow device placement in Google Cloud ML engine

问题 I am running a large distributed Tensorflow model in google cloud ML engine. I want to use machines with GPUs. My graph consists of two main the parts the input/data reader function and the computation part. I wish to place variables in the PS task, the input part in the CPU and the computation part on the GPU. The function tf.train.replica_device_setter automatically places variables in the PS server. This is what my code looks like: with tf.device(tf.train.replica_device_setter(cluster

Distributed Tensorflow device placement in Google Cloud ML engine

阅读更多关于 Distributed Tensorflow device placement in Google Cloud ML engine

What are the latencies of GPU?

阅读更多关于 What are the latencies of GPU?

问题 I can find the latencies in terms of either ns or CPU cylces between CPU core and its cache, main memory, etc. But it seems so hard to find similiar information about modern GPU. Does anyone know about the latencies of GPU, esepecially the latencies between modern nvidia GPU (GF110 or later) and their memory, thanks. GPU memory do have a much larger bandwidth, but what about their latencies? I heard that the latencies for GPU are just as high as these for CPU, so basically make the larger

Getting Theano to use the GPU

阅读更多关于 Getting Theano to use the GPU

问题 I am having quite a bit of trouble setting up Theano to work with my graphics card - I hope you guys can give me a hand. I have used CUDA before and it is properly installed as would be necessary to run Nvidia Nsight. However, I now want to use it with PyDev and am having several problems following the 'Using the GPU' part of the tutorial at http://deeplearning.net/software/theano/install.html#gpu-linux The first is quite basic, and that is how to set up the environment variables. It says I

How to run a GPU instance using Amazon EC2 Panel?

阅读更多关于 How to run a GPU instance using Amazon EC2 Panel?

问题 I would like to run a Ubuntu GPU instance from the AWS EC2 control panel, but the combo box does not have the g2.2xlarge option to select. It looks like GPU instances are available only for Amazon AMI. When I choose Ubuntu, it does not list GPU. Is there any way to make it work? 回答1: In order to use the g2.2xlarge instance type, you need to first select an AMI that is built with HVM (hardware assisted virtualization). At the time of this writing, the official HVM AMIs for Ubuntu are not

Tensorflow only sees XLA_GPUs and cannot use them

阅读更多关于 Tensorflow only sees XLA_GPUs and cannot use them

问题 I have a machine with 8 GPUS (4x GPU GTX 1080 Ti of 11 Gb de RAM and 4x RTX 1080) and cannot get tensorflow to use them correctly (or at all). When I do from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) It prints [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 5295519098812813462 , name: "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 locality { } incarnation: 12186007115805339517 physical

multi-gpu cuda: Run kernel on one device and modify elements on the other?

阅读更多关于 multi-gpu cuda: Run kernel on one device and modify elements on the other?

问题 Suppose I have multiple GPU's in a machine and I have a kernel running on GPU0. With the UVA and P2P features of CUDA 4.0, can I modify the contents of an array on another device say GPU1 when the kernel is running on GPU0? The simpleP2P example in the CUDA 4.0 SDK does not demonstrate this. It only demonstrates: Peer-to-peer memcopies A kernel running on GPU0 which reads input from GPU1 buffer and writes output to GPU0 buffer A kernel running on GPU1 which reads input from GPU0 buffer and

Can consecutive CUDA atomic operations on global memory benefit from L2 cache?

阅读更多关于 Can consecutive CUDA atomic operations on global memory benefit from L2 cache?

问题 In a cache-enabled CUDA device, does locality of references in consecutive atomic operations on global memory addresses by one thread benefit from L2 cache? For example, I have an atomic operation in a CUDA kernel that uses the returned value. uint a = atomicAnd( &(GM_addr[index]), b ); I'm thinking if I'm about to use atomic by the thread in the same kernel again , if I can confine the address of new atomic operation to 32-byte long [ &(GM_addr[index&0xFFFFFFF8]), &(GM_addr[index|7]) ]