nvidia

Testing GPU with tensorflow matrix multiplication

血红的双手。 提交于 2019-11-29 07:17:18
As many machine learning algorithms rely to matrix multiplication(or at least can be implemented using matrix multiplication) to test my GPU is I plan to create matrices a , b , multiply them and record time it takes for computation to complete. Here is code that will generate two matrices of dimensions 300000,20000 and multiply them : import tensorflow as tf import numpy as np init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) #a = np.array([[1, 2, 3], [4, 5, 6]]) #b = np.array([1, 2, 3]) a = np.random.rand(300000,20000) b = np.random.rand(300000,20000) println("Init

CUDA - how much slower is transferring over PCI-E?

寵の児 提交于 2019-11-29 05:02:39
If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes? What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs? Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The

How to run CUDA without a GPU using a software implementation?

北城余情 提交于 2019-11-29 04:10:44
My laptop doesn't have a nVidia graphic cards, and I want to work on CUDA. The website says that CUDA can be used in emulation mode on non-cuda hardware too. But when I tried installing CUDA drivers downloaded from their website, it gives an error "The nvidia setup couldn't locate any drivers that are compatible with your current hardware. Setup will now exit". Also when I tried to run sample codes from SDK in Visual studio 2008, I'm getting an error that .obj file is not found. Nils The easiest way to get started with GPU development is to get a cheap (for example GTX285) GPU and a desktop

GPU shared memory size is very small - what can I do about it?

╄→尐↘猪︶ㄣ 提交于 2019-11-29 03:08:22
The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today. I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers = 10,000 * 4b = 40kb. How can I work around this? Is there any GPU that has more than 16 KiB of shared memory ? Think of shared memory as explicitly managed cache. You will need to store your array in global memory and cache parts of it in shared memory as needed, either by making multiple passes or some other scheme which minimises the number of

OpenMP 4.0 in GCC: offload to nVidia GPU

落花浮王杯 提交于 2019-11-29 02:42:05
问题 TL;DR - Does GCC (trunk) already support OpenMP 4.0 offloading to nVidia GPU? If so, what am I doing wrong? (description below). I'm running Ubuntu 14.04.2 LTS . I have checked out the most recent GCC trunk (dated 25 Mar 2015). I have installed the CUDA 7.0 toolkit according to Getting Started on Ubuntu guide. CUDA samples run successfully, i.e. deviceQuery detects my GeForce GT 730. I have followed the instructions from https://gcc.gnu.org/wiki/Offloading as well as https://gcc.gnu.org

ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

百般思念 提交于 2019-11-29 02:28:26
After finally managing to get my code to compile with OpenCL, I cannot seem to get the output binary to run! This is on my linux laptop running Kubuntu 13.10 x64 The error I get is (Printed from cl::Error): ERROR: clGetPlatformIDs -1001 I found this post but there does not seem to be a clear solution. I added myself to the video group but this does not seem to work. With regards to the ICD profile... I am not sure what I need to do - shouldn't this be included with the cuda toolkit? If not, where could I download one? EDIT : It seems I have an ICD file in my system under /usr/share/nvidia-331

Why is my GPU slower than CPU when training LSTM/RNN models?

安稳与你 提交于 2019-11-29 02:03:10
问题 My machine has the following spec: CPU: Xeon E5-1620 v4 GPU: Titan X (Pascal) Ubuntu 16.04 Nvidia driver 375.26 CUDA tookit 8.0 cuDNN 5.1 I've benchmarked on the following Keras examples with Tensorflow as the backed reference: SCRIPT NAME GPU CPU stated_lstm.py 5sec 5sec babi_rnn.py 10sec 12sec imdb_bidirectional_lstm.py 240sec 116sec imbd_lstm.py 113sec 106sec My gpu is clearly out performing my cpu in non-lstm models. SCRIPT NAME GPU CPU cifar10_cnn.py 12sec 123sec imdb_cnn.py 5sec 119sec

Matrix-vector multiplication in CUDA: benchmarking & performance

天大地大妈咪最大 提交于 2019-11-28 20:47:39
I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected by the number of columns of A , which, in turn, implies that there is some sort of parallelisation

Running more than one CUDA applications on one GPU

与世无争的帅哥 提交于 2019-11-28 17:07:05
问题 CUDA document does not specific how many CUDA process can share one GPU. For example, if I launch more than one CUDA programs by the same user with only one GPU card installed in the system, what is the effect? Will it guarantee the correctness of execution? How does the GPU schedule tasks in this case? 回答1: CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take

How to perform Hadamard product with CUBLAS on complex numbers?

走远了吗. 提交于 2019-11-28 14:43:00
I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ? I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS). talonmies CUBLAS is based on the reference BLAS,