nvidia | 易学教程

CUDA - how much slower is transferring over PCI-E?

阅读更多关于 CUDA - how much slower is transferring over PCI-E?

问题 If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes? What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring

How to run CUDA without a GPU using a software implementation?

阅读更多关于 How to run CUDA without a GPU using a software implementation?

问题 My laptop doesn't have a nVidia graphic cards, and I want to work on CUDA. The website says that CUDA can be used in emulation mode on non-cuda hardware too. But when I tried installing CUDA drivers downloaded from their website, it gives an error "The nvidia setup couldn't locate any drivers that are compatible with your current hardware. Setup will now exit". Also when I tried to run sample codes from SDK in Visual studio 2008, I'm getting an error that .obj file is not found. 回答1: The

Streaming multiprocessors, Blocks and Threads (CUDA)

阅读更多关于 Streaming multiprocessors, Blocks and Threads (CUDA)

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads? What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads? My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core. Is this correct? Edric The thread / block layout is described in detail in the CUDA programming guide . In

ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

阅读更多关于 ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)

问题 After finally managing to get my code to compile with OpenCL, I cannot seem to get the output binary to run! This is on my linux laptop running Kubuntu 13.10 x64 The error I get is (Printed from cl::Error): ERROR: clGetPlatformIDs -1001 I found this post but there does not seem to be a clear solution. I added myself to the video group but this does not seem to work. With regards to the ICD profile... I am not sure what I need to do - shouldn't this be included with the cuda toolkit? If not,

Number of Compute Units corresponding to the number of work groups

阅读更多关于 Number of Compute Units corresponding to the number of work groups

问题 I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS , the result is 2. I read the number of work groups for running a kernel should correspond to the number of compute units (Heterogenous Computing with OpenCL, Chapter 9, p. 186), otherwise it would waste too much global memory bandwitdh. Also the chip is specified to have 16 cuda cores (which correspond to PEs I believe). Does that mean

Forcing NVIDIA GPU programmatically in Optimus laptops

阅读更多关于 Forcing NVIDIA GPU programmatically in Optimus laptops

I'm programming a DirectX game, and when I run it on an Optimus laptop the Intel GPU is used, resulting in horrible performance. If I force the NVIDIA GPU using the context menu or by renaming my executable to bf3.exe or some other famous game executable name, performance is as expected. Obviously neither is an acceptable solution for when I have to redistribute my game, so is there a way to programmatically force the laptop to use the NVIDIA GPU? I've already tried using DirectX to enumerate adapters (IDirect3D9::GetAdapterCount, IDirect3D9::GetAdapterIdentifier) and it doesn't work: only 1

CUDA determining threads per block, blocks per grid

阅读更多关于 CUDA determining threads per block, blocks per grid

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things. I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case? In general you want to size your blocks/grid to match your data and

CUDA program causes nvidia driver to crash

阅读更多关于 CUDA program causes nvidia driver to crash

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated. #include <stdio.h> #include <stdlib.h> #include <cuda.h> #include <curand.h> #include <curand_kernel.h> #define NUM_THREAD 256 #define NUM_BLOCK 256 /////////////////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////////////////// // Function to sum an array __global__ void

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

阅读更多关于 Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04. Here is the minimum code that can reproduce the result: CUDASample.cuh class CUDASample{ public: void AddOneToVector(std::vector<int> &in); }; CUDASample.cu __global__ static void CUDAKernelAddOneToVector(int *data) { const int x = blockIdx.x * blockDim.x + threadIdx.x; const int y = blockIdx.y * blockDim.y + threadIdx.y; const int mx = gridDim.x * blockDim.x; data[y * mx + x] = data[y * mx + x] + 1.0f; } void CUDASample:

How does CUDA assign device IDs to GPUs?

阅读更多关于 How does CUDA assign device IDs to GPUs?

When a computer has multiple CUDA-capable GPUs, each GPU is assigned a device ID . By default, CUDA kernels execute on device ID 0 . You can use cudaSetDevice(int device) to select a different device. Let's say I have two GPUs in my machine: a GTX 480 and a GTX 670. How does CUDA decide which GPU is device ID 0 and which GPU is device ID 1 ? Ideas for how CUDA might assign device IDs (just brainstorming): descending order of compute capability PCI slot number date/time when the device was added to system (device that was just added to computer is higher ID number) Motivation : I'm working on