nvidia

cudaGetCacheConfig takes 0.5 seconds - how/why? [duplicate]

[亡魂溺海] 提交于 2019-11-28 13:41:39
问题 This question already has answers here : slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice (2 answers) Closed 2 years ago . I'm using CUDA 8.0 on a Xeon-based system with a GTX Titan X (GM 200). It works fine, but - I get long overheads compared to my weak GTX 600 series card at home. Specifically, when I timeline I find that a call to cudaGetCacheConfig() is consistently taking the CUDA runtime API an incredible amount of time: 530-560 msec, or over 0.5 seconds. This, while

Error -1001 in clGetPlatformIDs Call !

我与影子孤独终老i 提交于 2019-11-28 13:17:01
I am trying to start working with OpenCL. I have two NVidia graphics card, I installed "developer driver" as well as SDK from NVidia website. I compiled the demos but when I run ./oclDeviceQuery I see: OpenCL SW Info: Error -1001 in clGetPlatformIDs Call !!! How can I fix it? Does it mean my nvidia cards cannot be detected? I am running Ubuntu 10.10 and X server works properly with nvidia driver. I am pretty sure the problem is not related to file permissions as it doesn't work with sudo either. In my case I have solved it by installing nvidia-modprobe package available in ubuntu (utopic

Compile cuda code for CPU

蓝咒 提交于 2019-11-28 12:12:06
I'm study cuda 5.5 but i don't have any Nvidia GPU. In old version of nvcc have a flag --multicore to compile cuda code for CPU. In the new version of nvcc, what's is the option?? I'm working on Linux. Robert Crovella CUDA toolkits since at least CUDA 4.0 have not supported an ability to run cuda code without a GPU. If you simply want to compile code, refer to this question . If you want to run CUDA codes compiled with CUDA 5.5, you will need a CUDA capable GPU. If you're willing to use older CUDA toolkits, you could install one of the various emulators, such as this one . Or you could install

Do I have to use the MPS (MULTI-PROCESS SERVICE) when using CUDA6.5 + MPI?

只愿长相守 提交于 2019-11-28 11:08:18
问题 By the link is written: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf 1.1. AT A GLANCE 1.1.1. MPS The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs , to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs. Hyper-Q

How can using cooperative groups feature of CUDA in windows

感情迁移 提交于 2019-11-28 10:49:25
问题 My GPU is GeForce MX150, pascal architecture, CC. 6.1, CUDA 9.1, windows 10. Although my GPU is pascal but cooperative groups doesn't work. I want to use it for inter-block synchronization. I found my tcc mode doesn't active. I also found that doesn't active in wddm in windows. How can using cooperative groups? How can activate tcc mode in windows? Thanks for your reply. 回答1: You can't activate TCC on that GPU (it is not supported), and there is no way to use a cooperative launch under

clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

我的未来我决定 提交于 2019-11-28 10:41:52
问题 On Nvidia GPUs, when I call clEnqueueNDRange , the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange , but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a

CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?

走远了吗. 提交于 2019-11-28 10:11:07
Has anyone started to work with the CUDA5 SDK? I have an old project that uses some cutil functions, but they've been abandoned in the new one. The solution was that most functions can be translated from cutil*/cut* to a similar named sdk* equivalent from the helper*.h headers... As an example: cutStartTimer becomes sdkCreateTimer Just that simple... Has anyone started to work with the CUDA5 SDK? Probably. Has anyone translated some cutil definitions to CUDA5? Maybe. But why not just use the new header files intended to replace it? Quoted from the Beta release notes : Prior to CUDA 5.0, CUDA

Cuda compiler not working with GCC 4.5 +

我与影子孤独终老i 提交于 2019-11-28 09:57:25
问题 I am new to Cuda, and I am trying to compile this simple test_1.cu file: #include <stdio.h> __global__ void kernel(void) { } int main (void) { kernel<<<1,1>>>(); printf( "Hello, World!\n"); return 0; } using this: nvcc test_1.cu The output I get is: In file included from /usr/local/cuda/bin/../include/cuda_runtime.h:59:0, from <command-line>:0: /usr/local/cuda/bin/../include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported! my gcc --version: gcc

Using constants with CUDA

谁说我不能喝 提交于 2019-11-28 09:14:46
Which is the best way of using constants in CUDA? One way is to define constants in constant memory, like: // CUDA global constants __constant__ int M; int main(void) { ... cudaMemcpyToSymbol("M", &M, sizeof(M)); ... } An alterative way would be to use the C preprocessor: #define M = ... I would think defining constants with the C preprocessor is much faster. Which are then the benefits of using the constant memory on a CUDA device? Robert Crovella constants that are known at compile time should be defined using preprocessor macros (e.g. #define ) or via C/C++ const variables at global/file

Are cuda kernel calls synchronous or asynchronous

旧街凉风 提交于 2019-11-28 08:59:17
I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize