nvidia | 易学教程

CUDA: Why Thrust is so slow on uploading data to GPU?

阅读更多关于 CUDA: Why Thrust is so slow on uploading data to GPU?

问题 I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is? Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64 GPU Bandwidth test: It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't! Here's the test program: #include <iostream> #include

The behavior of stream 0 (default) and other streams

阅读更多关于 The behavior of stream 0 (default) and other streams

问题 In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0; cudaStream_t stream1; /...creating stream1.../ somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1; In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate

Offline compilation for AMD and NVIDIA OpenCL Kernels without cards installed

阅读更多关于 Offline compilation for AMD and NVIDIA OpenCL Kernels without cards installed

问题 I was trying to figure out a way to perform offline compilation of OpenCL kernels without installing Graphics cards. I have installed the SDK's. Does anyone has any experience with compiling OpenCL Kernels without having the graphics cards installed for both any one of them NVIDIA or AMD. I had asked a similar question on AMD forums (http://devgurus.amd.com/message/1284379). NVIDIA forums for long are in accessible so couldn't get any help from there. Thanks 回答1: AMD has an OpenCL extension

Access GPU hardware specifications in Python?

阅读更多关于 Access GPU hardware specifications in Python?

问题 I want to access various NVidia GPU specifications using Numba or a similar Python CUDA pacakge. Information such as available device memory, L2 cache size, memory clock frequency, etc. From reading this question, I learned I can access some of the information (but not all) through Numba's CUDA device interface. from numba import cuda device = cuda.get_current_device() attribs = [s for s in dir(device) if s.isupper()] for attr in attribs: print(attr, '=', getattr(device, attr)) Output on a

How to convert Tensorflow 2.0 SavedModel to TensorRT?

阅读更多关于 How to convert Tensorflow 2.0 SavedModel to TensorRT?

问题 I've trained a model in Tensorflow 2.0 and am trying to improve predict time when moving to production (on a server with GPU support). In Tensorflow 1.x I was able to get a predict speedup by using freeze graph, but this has been deprecated as of Tensorflow 2. From reading Nvidia's description of TensorRT, they suggest that using TensorRT can speedup inference by 7x compared to Tensorflow alone. Source: TensorFlow 2.0 with Tighter TensorRT Integration Now Available I have trained my model and

How to convert Tensorflow 2.0 SavedModel to TensorRT?

阅读更多关于 How to convert Tensorflow 2.0 SavedModel to TensorRT?

Cannot run JavaFX app on docker for more than a few minutes

阅读更多关于 Cannot run JavaFX app on docker for more than a few minutes

问题 I developed an application used as a communication service for a separate web app. I had 0 issues "dockerizing" the web app but the service is proving to be a nightmare. It is based on JavaFX and there is a property that can be set by the user in the config file that makes it so the app does not initialize any windows, menus, containers, etc. This "headless" mode (not sure that is truly headless...) effectively turns the service app into a background service. Let me also preface this by

Cannot run JavaFX app on docker for more than a few minutes

阅读更多关于 Cannot run JavaFX app on docker for more than a few minutes

Programmatically selecting integrated graphics in nVidia Optimus

阅读更多关于 Programmatically selecting integrated graphics in nVidia Optimus

问题 There are many questions and answers about how to select nVidia discrete adapter on runtime on Windows platform. The easiest way is to export a NvOptimusEnablement variable like this: extern "C" _declspec(dllexport) DWORD NvOptimusEnablement = 0x00000001; I have the opposite requirement. I need to set the Integrated graphics in runtime for my application, no matter what is the Preferred graphic processor in NVIDIA control panel. This variable is not suitable for this. How can I make this? 回答1

Tensorflow 1.14 performance issue on rtx 3090

阅读更多关于 Tensorflow 1.14 performance issue on rtx 3090

问题 I am running a model written with TensorFlow 1.x on 4x RTX 3090 and it is taking a long time to start up the training than as in 1x RTX 3090. Although, as training starts, it gets finished up earlier in 4x than in 1x. I am using CUDA 11.1 and TensorFlow 1.14 in both the GPUs. Secondly, When I am using 1x RTX 2080ti, with CUDA 10.2 and TensorFlow 1.14, it is taking less amount to start the training as compared to 1x RTX 3090 with 11.1 CUDA and Tensorflow 1.14. Tentatively, it is taking 5 min