gpu | 易学教程

error in creating gpu google instance

阅读更多关于 error in creating gpu google instance

问题 I have tried creating GPU instance in Google Cloud Platform but every time I try to create an instance it shows "You've reached your limit of 0 GPUs NVIDIA K80". I am trying to create an instance with 4 vCPU, 8-15 gb memory, 1 GPU and in us-east1-c/us-west1-b. Please help for the following. 回答1: Follow all the steps in specified order, because otherwise GPUs won't be seen in Quotas page. You need to go to the Quotas part of IAM & Admin: https://console.cloud.google.com/projectselector/iam

OpenCL reduction result wrong with large floats

阅读更多关于 OpenCL reduction result wrong with large floats

问题 I used AMD's two-stage reduction example to compute the sum of all numbers from 0 to 65 536 using floating point precision. Unfortunately, the result is not correct. However, when I modify my code, so that I compute the sum of 65 536 smaller numbers (for example 1), the result is correct. I couldn't find any error in the code. Is it possible that I am getting wrong results, because of the float type? If this is the case, what is the best approach to solve the issue? 回答1: There is probably no

Tensorflow matmul calculations on GPU are slower than on CPU

阅读更多关于 Tensorflow matmul calculations on GPU are slower than on CPU

问题 I'm experimenting with GPU computations for the first time and was hoping for a big speed-up, of course. However with a basic example in tensorflow, it actually was worse: On cpu:0, each of the ten runs takes on average 2 seconds, gpu:0 takes 2.7 seconds and gpu:1 is 50% worse than cpu:0 with 3 seconds. Here's the code: import tensorflow as tf import numpy as np import time import random for _ in range(10): with tf.Session() as sess: start = time.time() with tf.device('/gpu:0'): # swap for

OpenCV3: where has cv::cuda::Stream::enqueueUpload() gone?

阅读更多关于 OpenCV3: where has cv::cuda::Stream::enqueueUpload() gone?

问题 In former versions of OpenCV there was the function Stream::enqueueUpload that could be used to upload data to the GPU asynchronously together with CudaMem (compare: how to use gpu::Stream in OpenCV?). However, this function does no longer exist in OpenCV 3. The CudaMem class is also gone but seems to have been replaced by the HostMem class. Can anyone tell me how to perform an asynchronous upload in OpenCV 3? 回答1: It can be done now via void GpuMat::upload(InputArray arr, Stream& stream)

Modified Nvidia Maxwell, increased global memory instruction count

阅读更多关于 *Modified* Nvidia Maxwell, increased global memory instruction count

问题 I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture. spec. for both graphic cards GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s Ubuntu 14.04 CUDA driver 340.29 toolkit 6.5 I compiled the benchmark application(No modification) then

CUDA Matrix multiplication breaks for large matrices

阅读更多关于 CUDA Matrix multiplication breaks for large matrices

问题 I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so. int size = Width*Width*sizeof(float); float* Md, *Nd, *Pd; cudaError_t err = cudaSuccess; //Allocate Device Memory for M, N and P err = cudaMalloc((void**)&Md, size); err = cudaMalloc((void**)&Nd, size); err = cudaMalloc((void**)&Pd,

How to interrupt or cancel a CUDA kernel from host code

阅读更多关于 How to interrupt or cancel a CUDA kernel from host code

问题 I am working with CUDA and I am trying to stop my kernels work (i.e. terminate all running threads) after a certain if block is being hit. How can I do that? I am really stuck in here. 回答1: I assume you want to stop a running kernel (not a single thread). The simplest approach (and the one that I suggest) is to set up a global memory flag which is been tested by the kernel. You can set the flag using cudaMemcpy() (or without if using unified memory). Like the following: if (gm_flag) { _

How to use coalesced memory access

阅读更多关于 How to use coalesced memory access

问题 I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help? 回答1: Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing: arr[tid] --- gives perfect coalescence arr[tid+5] --- is almost perfect, probably misaligned arr[tid

Use vivante GPU on IMX6 with 4.14 kernel

阅读更多关于 Use vivante GPU on IMX6 with 4.14 kernel

问题 I am working on IMX6QP with yocto rocko / Linux 4.14.24 and I am trying to use the GPU. My yocto configuration file : MACHINE ??= 'imx6qp-tx6-emmc' DL_DIR ?= "${BSPDIR}/downloads" SSTATE_DIR ?= "${BSPDIR}/sstate-cache" DISTRO ?= 'karo-minimal' PACKAGE_CLASSES ?= "package_deb" EXTRA_IMAGE_FEATURES ?= "debug-tweaks" VIRTUAL-RUNTIME_init_manager = "sysvinit" USER_CLASSES ?= "buildstats image-mklibs image-prelink" PATCHRESOLVE = "noop" BB_DISKMON_DIRS ??= "\ STOPTASKS,${TMPDIR},1G,100K \

Use vivante GPU on IMX6 with 4.14 kernel

阅读更多关于 Use vivante GPU on IMX6 with 4.14 kernel