nvidia | 易学教程

Cuda compiler not working with GCC 4.5 +

阅读更多关于 Cuda compiler not working with GCC 4.5 +

I am new to Cuda, and I am trying to compile this simple test_1.cu file: #include <stdio.h> __global__ void kernel(void) { } int main (void) { kernel<<<1,1>>>(); printf( "Hello, World!\n"); return 0; } using this: nvcc test_1.cu The output I get is: In file included from /usr/local/cuda/bin/../include/cuda_runtime.h:59:0, from <command-line>:0: /usr/local/cuda/bin/../include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported! my gcc --version: gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1 Copyright (C) 2011 Free Software Foundation, Inc. This is free

Cannot compile OpenCL application using 1.2 headers in 1.1 version

阅读更多关于 Cannot compile OpenCL application using 1.2 headers in 1.1 version

问题 I'm writing a small hello world OpenCL program using Khronos Group's cl.hpp for OpenCL 1.2 and nVidia's openCL libraries. The drivers and ICD I have support OpenCL 1.1. Since the nVidia side doesn't support 1.2 yet, I get some errors on functions required on OpenCL 1.2. On the other side, cl.hpp for OpenCL 1.2 has a flag, CL_VERSION_1_1 to be exact, to run the header in 1.1 mode, but it's not working. Anybody has similar experience or solution? Note: cl.hpp for version 1.1 works but,

CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

阅读更多关于 CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like this at clEnqueueReadBuffer right? It should be when I allocated the data right? Is there some way to get more details than just the error code? It would be cool if I could see how much VRAM was allocated when OpenCL declared CL_OUT_OF_RESOURCES. Eric Towers Not

CUDA global (as in C) dynamic arrays allocated to device memory

阅读更多关于 CUDA global (as in C) dynamic arrays allocated to device memory

问题 So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device. As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance. So, I have device functions that want to access a device allocated array. Ideally, I

CUDA fails when trying to use both onboard iGPU and Nvidia discrete card. How can i use both discrete nvidia and integrated (onboard) intel gpu? [closed]

阅读更多关于 CUDA fails when trying to use both onboard iGPU and Nvidia discrete card. How can i use both discrete nvidia and integrated (onboard) intel gpu? [closed]

I had recently some trouble making my pc (ivybridge) use the onboard gpu (intel igpu HD4000) for normal screen display usage, while i run my CUDA programs for computations on the discrete Nvidia GT 640 i have on my machine. The problem was that under iGPU display, CUDA would be unable to spot the nvidia card , and the nvidia drivers would not load at all. Keep in mind that there are confirmed issues (mostly about concurrency) when using the nvidia windows drivers for display devices, and also want to use CUDA. Those issues can get overridden when you use the Intel gpu as display (thus loading

TensorFlow in nvidia-docker: failed call to cuInit: CUDA_ERROR_UNKNOWN

阅读更多关于 TensorFlow in nvidia-docker: failed call to cuInit: CUDA_ERROR_UNKNOWN

I have been working on getting an application that relies on TensorFlow to work as a docker container with nvidia-docker . I have compiled my application on top of the tensorflow/tensorflow:latest-gpu-py3 image. I run my docker container with the following command: sudo nvidia-docker run -d -p 9090:9090 -v /src/weights:/weights myname/myrepo:mylabel When looking at the logs through portainer I see the following: 2017-05-16 03:41:47.715682: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your

How to control stereo-frames separately with C#? (NVIDIA 3D shutter glasses)

阅读更多关于 How to control stereo-frames separately with C#? (NVIDIA 3D shutter glasses)

问题 I’m trying to make a very simple application which would display different images on each eye. I have Asus VG236H monitor and NVIDIA 3D Vision kit, the stereo 3D shutter glasses. The I’m using C#, .NET Framework 2.0, DirectX 9 (Managed Direct X) and Visual Studio 2008. I have been searching high and low for examples and tutorials, have actually found a couple and based those I have created the program but for some reason I can’t get it working. When looking for examples how to display

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

阅读更多关于 Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

问题 NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically: 1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network

NV_STEREO_IMAGE_SIGNATURE and DirectX 10/11 (nVidia 3D Vision)

阅读更多关于 NV_STEREO_IMAGE_SIGNATURE and DirectX 10/11 (nVidia 3D Vision)

I'm trying to use SlimDX and DirectX10 or 11 to control the stereoization process on the nVidia 3D Vision Kit. Thanks to this question I've been able to make it work in DirectX 9. However, due to some missing methods I've been unable to make it work under DirectX 10 or 11. The algorithm goes like this: Render left eye image Render right eye image Create a texture able to contain them both PLUS an extra row (so the texture size would be 2 * width, height + 1) Write this NV_STEREO_IMAGE_SIGNATURE value Render this texture on the screen My test code skips the first two steps, as I already have a

L2 cache in Kepler

阅读更多关于 L2 cache in Kepler

问题 How does L2 cache work in GPUs with Kepler architecture in terms of locality of references? For example if a thread accesses an address in global memory, supposing the value of that address is not in L2 cache, how is the value being cached? Is it temporal? Or are other nearby values of that address brought to L2 cache too (spatial)? Below picture is from NVIDIA whitepaper. 回答1: Unified L2 cache was introduced with compute capability 2.0 and higher and continues to be supported on the Kepler