nvidia | 易学教程

How many memory latency cycles per memory access type in OpenCL/CUDA?

阅读更多关于 How many memory latency cycles per memory access type in OpenCL/CUDA?

问题 I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency. I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure. Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does

How is a CUDA kernel launched?

阅读更多关于 How is a CUDA kernel launched?

问题 I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices. I know this is a very basic concept, but I don't know this. I am confused regarding the flow. 回答1: You launch a grid of blocks. Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor

How to debug OpenCL on Nvidia GPUs?

阅读更多关于 How to debug OpenCL on Nvidia GPUs?

问题 Is there any way to debug OpenCL kernels on an Nvidia GPU, i.e. set breakpoints and inspect variables? My understanding is that Nvidia's tool does not allow OpenCL debugging, and AMD's and Intel's only allow it on their own devices. 回答1: gDEBugger might help you somewhat (never used it though), but other than that there isn't any tool that I know of that can set breakpoints or inspect variables inside a kernel. Perhaps try to save intermediate outputs from your kernel if it is a long kernel.

How do I see the commands that are run by GNU make?

阅读更多关于 How do I see the commands that are run by GNU make?

问题 I'm trying to debug a complex Makefile. How do you get GNU make to print all the commands it runs? I couldn't find the answer in the man page (using the -d flag doesn't seem to print it). (This isn't necessary information to answer my question, but in case you're wondering: I'm having trouble compiling a project built on NVIDIA's CUDA library. I can compile it myself, but using their Makefile results in a nasty compiler error. I'd like to use their provided Makefile for easier packaging, and

Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error

阅读更多关于 Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error

问题 I am using Tesla C2050, which has a compute capability 2.0 and has shared memory 48KB . BUt when I try to use this shared memory the nvcc compiler gives me the following error Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) My SAT1 is the naive implementation of scan algorithm, and because I am operating on images sizes of the order 4096x2160 I have to use double to calculate the cumulative sum. Though Tesla C2050 does not support double

OpenCV - Copy GpuMat into cuda device data

阅读更多关于 OpenCV - Copy GpuMat into cuda device data

问题 I am trying to copy the data in a cv::cuda::GpuMat to a uint8_t* variable which is to be used in a kernel. The GpuMat contains an image data of resolution 752x480 and of type CV_8UC1. Below is the sample code: uint8_t *imgPtr; cv::Mat left, downloadedLeft; cv::cuda::GpuMat gpuLeft; left = imread("leftview.jpg", cv::IMREAD_GRAYSCALE); gpuLeft.upload(left); cudaMalloc((void **)&imgPtr, sizeof(uint8_t)*gpuLeft.rows*gpuLeft.cols); cudaMemcpyAsync(imgPtr, gpuLeft.ptr<uint8_t>(), sizeof(uint8_t)

Inter-block synchronization in CUDA

阅读更多关于 Inter-block synchronization in CUDA

问题 I've searched a month for this problem. I cannot synchronize blocks in CUDA. I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array. When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but

GeForce Experience share feature generates whitelist errors and slows performance

阅读更多关于 GeForce Experience share feature generates whitelist errors and slows performance

问题 I'm developing an application which is previewing video feeds from a capture card and/or a webcam. I've noticed a lot of errors in my console that look like: IGIESW [path to my.exe] found in whitelist: NO IGIWHW Game [path to my.exe] found in whitelist: NO These repeat each time I try to activate a preview window or switch the source feed I'm trying to preview. It actually takes a few seconds each time and it really kills the responsiveness of my application. I'm also seeing a similar

Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)

阅读更多关于 Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)

I am using Compute Prof 3.2 and a Geforce GTX 280. I have compute capability 1.3 then I believe. This file , seems to show that I should be able to see these fields since I am using a 1.x compute device. Well I don't see them and the User Guide for 3.2 toolkit says I can't see them, but calls them gst_uncoalesced and gst_coalesced . To sum up, I am confused about how I should figure out from the profiler if I am making non-coalesced reads from global memory. It doesn't look like Fermi cards will say either, but I am not worried about them for now. If anybody can elaborate on the situation I

ubuntu 16.04 LTS login loop after updating driver nvidia-396

阅读更多关于 ubuntu 16.04 LTS login loop after updating driver nvidia-396

问题 I have an issue on login to my computer when nvidia-396 is installed. It returns to login screen after giving error message pop up. When I remove the nvidia* and restart lightdm it works fine. Could you please help me fixing this. Thanks. 回答1: I had the same issue with this driver. my system is: Nvidia gtx 1060 (6gb) AMD Fx 8350 ASUS motherboard I was using the 390 driver ( 394.48 ), then upgraded to 396 and got this 'lightdm<->nvidia driver' problem. It seems that mostly users are getting