nvidia | 易学教程

Cannot compile OpenCL application using 1.2 headers in 1.1 version

阅读更多关于 Cannot compile OpenCL application using 1.2 headers in 1.1 version

I'm writing a small hello world OpenCL program using Khronos Group's cl.hpp for OpenCL 1.2 and nVidia's openCL libraries. The drivers and ICD I have support OpenCL 1.1. Since the nVidia side doesn't support 1.2 yet, I get some errors on functions required on OpenCL 1.2. On the other side, cl.hpp for OpenCL 1.2 has a flag, CL_VERSION_1_1 to be exact, to run the header in 1.1 mode, but it's not working. Anybody has similar experience or solution? Note: cl.hpp for version 1.1 works but, generates many warnings during compilation. This is why I'm trying to use 1.2 version. Unfortunately NVIDIA

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

阅读更多关于 Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically: 1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here . In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to

How to prevent two CUDA programs from interfering

阅读更多关于 How to prevent two CUDA programs from interfering

问题 I've noticed that if two users try to run CUDA programs at the same time, it tends to lock up either the card or the driver (or both?). We need to either reset the card or reboot the machine to restore normal behavior. Is there a way to get a lock on the GPU so other programs can't interfere while it's running? Edit OS is Ubuntu 11.10 running on a server. While there is no X Windows running, the card is used to display the text system console. There are multiple users. 回答1: If you are running

CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

阅读更多关于 CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

This question already has an answer here: CUDA: How many concurrent threads in total? 3 answers We have a workstation with two Nvidia Quadro FX 5800 cards installed. Running the deviceQuery CUDA sample reveals that the maximum threads per multiprocessor (SM) is 1024, while the maximum threads per block is 512. Given that only one block can be executed on each SM at a time, why is max threads / processor double the max threads / block? How do we utilise the other 512 threads per SM? Device 1: "Quadro FX 5800" CUDA Driver Version / Runtime Version 5.0 / 5.0 CUDA Capability Major/Minor version

OpenMP 4.0 in GCC: offload to nVidia GPU

阅读更多关于 OpenMP 4.0 in GCC: offload to nVidia GPU

TL;DR - Does GCC (trunk) already support OpenMP 4.0 offloading to nVidia GPU? If so, what am I doing wrong? (description below). I'm running Ubuntu 14.04.2 LTS . I have checked out the most recent GCC trunk (dated 25 Mar 2015). I have installed the CUDA 7.0 toolkit according to Getting Started on Ubuntu guide. CUDA samples run successfully, i.e. deviceQuery detects my GeForce GT 730. I have followed the instructions from https://gcc.gnu.org/wiki/Offloading as well as https://gcc.gnu.org/install/specific.html#nvptx-x-none I have installed nvptx-tools and nvptx-newlib ( configure , make , sudo

VAO and element array buffer state

阅读更多关于 VAO and element array buffer state

I was recently writing some OpenGL 3.3 code with Vertex Array Objects (VAO) and tested it later on Intel graphics adapter where I found, to my disappointment, that element array buffer binding is evidently not part of VAO state, as calling: glBindVertexArray(my_vao); glDrawElements(GL_TRIANGLE_STRIP, count, GL_UNSIGNED_INTEGER, 0); had no effect, while: glBindVertexArray(my_vao); glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, my_index_buffer); // ? glDrawElements(GL_TRIANGLE_STRIP, count, GL_UNSIGNED_INTEGER, 0); rendered the geometry. I thought it was a mere bug in Intel implementation of OpenGL

Fedora 19 using rpmfussion's NVIDIA driver: libGL error: failed to load driver: swrast

阅读更多关于 Fedora 19 using rpmfussion's NVIDIA driver: libGL error: failed to load driver: swrast

问题 When running an app that uses Qt 4.7 on my Fedora 19 box I am getting the following errors from the application: libGL: screen 0 does not appear to be DRI2 capable libGL: OpenDriver: trying /usr/lib64/dri/tls/swrast_dri.so libGL: OpenDriver: trying /usr/lib64/dri/swrast_dri.so libGL: Can't open configuration file /home/Matthew.Hoggan/.drirc: No such file or directory. libGL error: failed to load driver: swrast ERROR: Error failed to create progam. I do not see these errors in a stock X11

Running more than one CUDA applications on one GPU

阅读更多关于 Running more than one CUDA applications on one GPU

CUDA document does not specific how many CUDA process can share one GPU. For example, if I launch more than one CUDA programs by the same user with only one GPU card installed in the system, what is the effect? Will it guarantee the correctness of execution? How does the GPU schedule tasks in this case? CUDA activity from independent host processes will normally create independent CUDA contexts , one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device. CUDA activity in separate contexts will be serialized

Do I have to use the MPS (MULTI-PROCESS SERVICE) when using CUDA6.5 + MPI?

阅读更多关于 Do I have to use the MPS (MULTI-PROCESS SERVICE) when using CUDA6.5 + MPI?

By the link is written: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf 1.1. AT A GLANCE 1.1.1. MPS The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs , to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs. Hyper-Q allows CUDA kernels to be processed concurrently on the same GPU; this can benefit performance when the

clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

阅读更多关于 clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

On Nvidia GPUs, when I call clEnqueueNDRange , the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange , but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy. This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for