nvidia | 易学教程

Testing GPU with tensorflow matrix multiplication

阅读更多关于 Testing GPU with tensorflow matrix multiplication

As many machine learning algorithms rely to matrix multiplication(or at least can be implemented using matrix multiplication) to test my GPU is I plan to create matrices a , b , multiply them and record time it takes for computation to complete. Here is code that will generate two matrices of dimensions 300000,20000 and multiply them : import tensorflow as tf import numpy as np init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) #a = np.array([[1, 2, 3], [4, 5, 6]]) #b = np.array([1, 2, 3]) a = np.random.rand(300000,20000) b = np.random.rand(300000,20000) println("Init

CUDA - how much slower is transferring over PCI-E?

阅读更多关于 CUDA - how much slower is transferring over PCI-E?

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes? What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs? Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The

How to run CUDA without a GPU using a software implementation?

阅读更多关于 How to run CUDA without a GPU using a software implementation?

My laptop doesn't have a nVidia graphic cards, and I want to work on CUDA. The website says that CUDA can be used in emulation mode on non-cuda hardware too. But when I tried installing CUDA drivers downloaded from their website, it gives an error "The nvidia setup couldn't locate any drivers that are compatible with your current hardware. Setup will now exit". Also when I tried to run sample codes from SDK in Visual studio 2008, I'm getting an error that .obj file is not found. Nils The easiest way to get started with GPU development is to get a cheap (for example GTX285) GPU and a desktop

GPU shared memory size is very small - what can I do about it?

阅读更多关于 GPU shared memory size is very small - what can I do about it?

The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today. I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers = 10,000 * 4b = 40kb. How can I work around this? Is there any GPU that has more than 16 KiB of shared memory ? Think of shared memory as explicitly managed cache. You will need to store your array in global memory and cache parts of it in shared memory as needed, either by making multiple passes or some other scheme which minimises the number of

OpenMP 4.0 in GCC: offload to nVidia GPU

阅读更多关于 OpenMP 4.0 in GCC: offload to nVidia GPU

问题 TL;DR - Does GCC (trunk) already support OpenMP 4.0 offloading to nVidia GPU? If so, what am I doing wrong? (description below). I'm running Ubuntu 14.04.2 LTS . I have checked out the most recent GCC trunk (dated 25 Mar 2015). I have installed the CUDA 7.0 toolkit according to Getting Started on Ubuntu guide. CUDA samples run successfully, i.e. deviceQuery detects my GeForce GT 730. I have followed the instructions from https://gcc.gnu.org/wiki/Offloading as well as https://gcc.gnu.org

ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

阅读更多关于 ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

After finally managing to get my code to compile with OpenCL, I cannot seem to get the output binary to run! This is on my linux laptop running Kubuntu 13.10 x64 The error I get is (Printed from cl::Error): ERROR: clGetPlatformIDs -1001 I found this post but there does not seem to be a clear solution. I added myself to the video group but this does not seem to work. With regards to the ICD profile... I am not sure what I need to do - shouldn't this be included with the cuda toolkit? If not, where could I download one? EDIT : It seems I have an ICD file in my system under /usr/share/nvidia-331

Why is my GPU slower than CPU when training LSTM/RNN models?

阅读更多关于 Why is my GPU slower than CPU when training LSTM/RNN models?

问题 My machine has the following spec: CPU: Xeon E5-1620 v4 GPU: Titan X (Pascal) Ubuntu 16.04 Nvidia driver 375.26 CUDA tookit 8.0 cuDNN 5.1 I've benchmarked on the following Keras examples with Tensorflow as the backed reference: SCRIPT NAME GPU CPU stated_lstm.py 5sec 5sec babi_rnn.py 10sec 12sec imdb_bidirectional_lstm.py 240sec 116sec imbd_lstm.py 113sec 106sec My gpu is clearly out performing my cpu in non-lstm models. SCRIPT NAME GPU CPU cifar10_cnn.py 12sec 123sec imdb_cnn.py 5sec 119sec

Matrix-vector multiplication in CUDA: benchmarking & performance

阅读更多关于 Matrix-vector multiplication in CUDA: benchmarking & performance

I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected by the number of columns of A , which, in turn, implies that there is some sort of parallelisation

Running more than one CUDA applications on one GPU

阅读更多关于 Running more than one CUDA applications on one GPU

问题 CUDA document does not specific how many CUDA process can share one GPU. For example, if I launch more than one CUDA programs by the same user with only one GPU card installed in the system, what is the effect? Will it guarantee the correctness of execution? How does the GPU schedule tasks in this case? 回答1: CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take

How to perform Hadamard product with CUBLAS on complex numbers?

阅读更多关于 How to perform Hadamard product with CUBLAS on complex numbers?

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ? I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS). talonmies CUBLAS is based on the reference BLAS,