nvidia | 易学教程

How is CUDA memory managed?

阅读更多关于 How is CUDA memory managed?

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management. Is there a virtual memory concept in CUDA? If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released? If more than one kernel are allowed to run on CUDA, how

CUFFT error handling

阅读更多关于 CUFFT error handling

问题 I'm using the following macro for CUFFT error handling: #define cufftSafeCall(err) __cufftSafeCall(err, __FILE__, __LINE__) inline void __cufftSafeCall(cufftResult err, const char *file, const int line) { if( CUFFT_SUCCESS != err) { fprintf(stderr, "cufftSafeCall() CUFFT error in file <%s>, line %i.\n", file, line); getch(); exit(-1); } } This macro does not return the message string from an error code. The book "CUDA Programming: a developer's guide to parallel computing with GPUs" suggests

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

阅读更多关于 How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

问题 Can I run non-MPI CUDA applications concurrently on NVIDIA Kepler GPUs with MPS? I'd like to do this because my applications cannot fully utilize the GPU, so I want them to co-run together. Is there any code example to do this? 回答1: The necessary instructions are contained in the documentation for the MPS service. You'll note that those instructions don't really depend on or call out MPI, so there really isn't anything MPI-specific about them. Here's a walkthrough/example. Read section 2.3 of

Forcing NVIDIA GPU programmatically in Optimus laptops

阅读更多关于 Forcing NVIDIA GPU programmatically in Optimus laptops

问题 I'm programming a DirectX game, and when I run it on an Optimus laptop the Intel GPU is used, resulting in horrible performance. If I force the NVIDIA GPU using the context menu or by renaming my executable to bf3.exe or some other famous game executable name, performance is as expected. Obviously neither is an acceptable solution for when I have to redistribute my game, so is there a way to programmatically force the laptop to use the NVIDIA GPU? I've already tried using DirectX to enumerate

CUDA determining threads per block, blocks per grid

阅读更多关于 CUDA determining threads per block, blocks per grid

问题 I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things. I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block,

CUDA program causes nvidia driver to crash

阅读更多关于 CUDA program causes nvidia driver to crash

问题 My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated. #include <stdio.h> #include <stdlib.h> #include <cuda.h> #include <curand.h> #include <curand_kernel.h> #define NUM_THREAD 256 #define NUM_BLOCK 256 /////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

阅读更多关于 Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

问题 I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04. Here is the minimum code that can reproduce the result: CUDASample.cuh class CUDASample{ public: void AddOneToVector(std::vector<int> &in); }; CUDASample.cu __global__ static void CUDAKernelAddOneToVector(int *data) { const int x = blockIdx.x * blockDim.x + threadIdx.x; const int y = blockIdx.y * blockDim.y + threadIdx.y; const

What is the correct version of CUDA for my nvidia driver?

阅读更多关于 What is the correct version of CUDA for my nvidia driver?

I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) I tried to install CUDA cuda-linux64-rel-7.0.28-19326674 but when I test by command ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL

Cuda kernel returning vectors

阅读更多关于 Cuda kernel returning vectors

问题 I have a list of words, my goal is to match each word in a very very long phrase. I\'m having no problem in matching each word, my only problem is to return a vector of structures containing informations about each match. In code: typedef struct { int A, B, C; } Match; __global__ void Find(veryLongPhrase * _phrase, Words * _word_list, vector<Match> * _matches) { int a, b, c; [...] //Parallel search for each word in the phrase if(match) //When an occurrence is found { _matches.push_back(new

OpenGL without X.org in linux

阅读更多关于 OpenGL without X.org in linux

问题 I\'d like to open an OpenGL context without X in Linux. Is there any way at all to do it? I know it\'s possible for integrated Intel graphics card hardware, though most people have Nvidia cards in their system. I\'d like to get a solution that works with Nvidia cards. If there\'s no other way than through integrated Intel hardware, I guess it\'d be okay to know how it\'s done with those. X11 protocol itself is too large and complex. Mouse/Keyboard/Tablet input multiplexing it provides is too