nvidia | 易学教程

cudaGetCacheConfig takes 0.5 seconds - how/why? [duplicate]

阅读更多关于 cudaGetCacheConfig takes 0.5 seconds - how/why? [duplicate]

问题 This question already has answers here : slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice (2 answers) Closed 2 years ago . I'm using CUDA 8.0 on a Xeon-based system with a GTX Titan X (GM 200). It works fine, but - I get long overheads compared to my weak GTX 600 series card at home. Specifically, when I timeline I find that a call to cudaGetCacheConfig() is consistently taking the CUDA runtime API an incredible amount of time: 530-560 msec, or over 0.5 seconds. This, while

Error -1001 in clGetPlatformIDs Call !

阅读更多关于 Error -1001 in clGetPlatformIDs Call !

I am trying to start working with OpenCL. I have two NVidia graphics card, I installed "developer driver" as well as SDK from NVidia website. I compiled the demos but when I run ./oclDeviceQuery I see: OpenCL SW Info: Error -1001 in clGetPlatformIDs Call !!! How can I fix it? Does it mean my nvidia cards cannot be detected? I am running Ubuntu 10.10 and X server works properly with nvidia driver. I am pretty sure the problem is not related to file permissions as it doesn't work with sudo either. In my case I have solved it by installing nvidia-modprobe package available in ubuntu (utopic

Compile cuda code for CPU

阅读更多关于 Compile cuda code for CPU

I'm study cuda 5.5 but i don't have any Nvidia GPU. In old version of nvcc have a flag --multicore to compile cuda code for CPU. In the new version of nvcc, what's is the option?? I'm working on Linux. Robert Crovella CUDA toolkits since at least CUDA 4.0 have not supported an ability to run cuda code without a GPU. If you simply want to compile code, refer to this question . If you want to run CUDA codes compiled with CUDA 5.5, you will need a CUDA capable GPU. If you're willing to use older CUDA toolkits, you could install one of the various emulators, such as this one . Or you could install

Do I have to use the MPS (MULTI-PROCESS SERVICE) when using CUDA6.5 + MPI?

阅读更多关于 Do I have to use the MPS (MULTI-PROCESS SERVICE) when using CUDA6.5 + MPI?

问题 By the link is written: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf 1.1. AT A GLANCE 1.1.1. MPS The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs , to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs. Hyper-Q

How can using cooperative groups feature of CUDA in windows

阅读更多关于 How can using cooperative groups feature of CUDA in windows

问题 My GPU is GeForce MX150, pascal architecture, CC. 6.1, CUDA 9.1, windows 10. Although my GPU is pascal but cooperative groups doesn't work. I want to use it for inter-block synchronization. I found my tcc mode doesn't active. I also found that doesn't active in wddm in windows. How can using cooperative groups? How can activate tcc mode in windows? Thanks for your reply. 回答1: You can't activate TCC on that GPU (it is not supported), and there is no way to use a cooperative launch under

clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

阅读更多关于 clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

问题 On Nvidia GPUs, when I call clEnqueueNDRange , the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange , but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a

CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?

阅读更多关于 CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?

Has anyone started to work with the CUDA5 SDK? I have an old project that uses some cutil functions, but they've been abandoned in the new one. The solution was that most functions can be translated from cutil*/cut* to a similar named sdk* equivalent from the helper*.h headers... As an example: cutStartTimer becomes sdkCreateTimer Just that simple... Has anyone started to work with the CUDA5 SDK? Probably. Has anyone translated some cutil definitions to CUDA5? Maybe. But why not just use the new header files intended to replace it? Quoted from the Beta release notes : Prior to CUDA 5.0, CUDA

Cuda compiler not working with GCC 4.5 +

阅读更多关于 Cuda compiler not working with GCC 4.5 +

问题 I am new to Cuda, and I am trying to compile this simple test_1.cu file: #include <stdio.h> __global__ void kernel(void) { } int main (void) { kernel<<<1,1>>>(); printf( "Hello, World!\n"); return 0; } using this: nvcc test_1.cu The output I get is: In file included from /usr/local/cuda/bin/../include/cuda_runtime.h:59:0, from <command-line>:0: /usr/local/cuda/bin/../include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported! my gcc --version: gcc

Using constants with CUDA

阅读更多关于 Using constants with CUDA

Which is the best way of using constants in CUDA? One way is to define constants in constant memory, like: // CUDA global constants __constant__ int M; int main(void) { ... cudaMemcpyToSymbol("M", &M, sizeof(M)); ... } An alterative way would be to use the C preprocessor: #define M = ... I would think defining constants with the C preprocessor is much faster. Which are then the benefits of using the constant memory on a CUDA device? Robert Crovella constants that are known at compile time should be defined using preprocessor macros (e.g. #define ) or via C/C++ const variables at global/file

Are cuda kernel calls synchronous or asynchronous

阅读更多关于 Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize