nvidia

How many memory latency cycles per memory access type in OpenCL/CUDA?

孤者浪人 提交于 2019-12-09 12:57:45
问题 I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency. I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure. Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does

How is a CUDA kernel launched?

懵懂的女人 提交于 2019-12-09 10:29:02
问题 I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices. I know this is a very basic concept, but I don't know this. I am confused regarding the flow. 回答1: You launch a grid of blocks. Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor

How to debug OpenCL on Nvidia GPUs?

荒凉一梦 提交于 2019-12-09 09:51:46
问题 Is there any way to debug OpenCL kernels on an Nvidia GPU, i.e. set breakpoints and inspect variables? My understanding is that Nvidia's tool does not allow OpenCL debugging, and AMD's and Intel's only allow it on their own devices. 回答1: gDEBugger might help you somewhat (never used it though), but other than that there isn't any tool that I know of that can set breakpoints or inspect variables inside a kernel. Perhaps try to save intermediate outputs from your kernel if it is a long kernel.

How do I see the commands that are run by GNU make?

↘锁芯ラ 提交于 2019-12-08 17:08:12
问题 I'm trying to debug a complex Makefile. How do you get GNU make to print all the commands it runs? I couldn't find the answer in the man page (using the -d flag doesn't seem to print it). (This isn't necessary information to answer my question, but in case you're wondering: I'm having trouble compiling a project built on NVIDIA's CUDA library. I can compile it myself, but using their Makefile results in a nasty compiler error. I'd like to use their provided Makefile for easier packaging, and

Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error

旧街凉风 提交于 2019-12-08 12:18:47
问题 I am using Tesla C2050, which has a compute capability 2.0 and has shared memory 48KB . BUt when I try to use this shared memory the nvcc compiler gives me the following error Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) My SAT1 is the naive implementation of scan algorithm, and because I am operating on images sizes of the order 4096x2160 I have to use double to calculate the cumulative sum. Though Tesla C2050 does not support double

OpenCV - Copy GpuMat into cuda device data

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-08 10:07:05
问题 I am trying to copy the data in a cv::cuda::GpuMat to a uint8_t* variable which is to be used in a kernel. The GpuMat contains an image data of resolution 752x480 and of type CV_8UC1. Below is the sample code: uint8_t *imgPtr; cv::Mat left, downloadedLeft; cv::cuda::GpuMat gpuLeft; left = imread("leftview.jpg", cv::IMREAD_GRAYSCALE); gpuLeft.upload(left); cudaMalloc((void **)&imgPtr, sizeof(uint8_t)*gpuLeft.rows*gpuLeft.cols); cudaMemcpyAsync(imgPtr, gpuLeft.ptr<uint8_t>(), sizeof(uint8_t)

Inter-block synchronization in CUDA

心已入冬 提交于 2019-12-08 09:36:49
问题 I've searched a month for this problem. I cannot synchronize blocks in CUDA. I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array. When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but

GeForce Experience share feature generates whitelist errors and slows performance

白昼怎懂夜的黑 提交于 2019-12-08 07:56:36
问题 I'm developing an application which is previewing video feeds from a capture card and/or a webcam. I've noticed a lot of errors in my console that look like: IGIESW [path to my.exe] found in whitelist: NO IGIWHW Game [path to my.exe] found in whitelist: NO These repeat each time I try to activate a preview window or switch the source feed I'm trying to preview. It actually takes a few seconds each time and it really kills the responsiveness of my application. I'm also seeing a similar

Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)

心已入冬 提交于 2019-12-08 07:31:19
I am using Compute Prof 3.2 and a Geforce GTX 280. I have compute capability 1.3 then I believe. This file , seems to show that I should be able to see these fields since I am using a 1.x compute device. Well I don't see them and the User Guide for 3.2 toolkit says I can't see them, but calls them gst_uncoalesced and gst_coalesced . To sum up, I am confused about how I should figure out from the profiler if I am making non-coalesced reads from global memory. It doesn't look like Fermi cards will say either, but I am not worried about them for now. If anybody can elaborate on the situation I

ubuntu 16.04 LTS login loop after updating driver nvidia-396

穿精又带淫゛_ 提交于 2019-12-08 07:02:08
问题 I have an issue on login to my computer when nvidia-396 is installed. It returns to login screen after giving error message pop up. When I remove the nvidia* and restart lightdm it works fine. Could you please help me fixing this. Thanks. 回答1: I had the same issue with this driver. my system is: Nvidia gtx 1060 (6gb) AMD Fx 8350 ASUS motherboard I was using the 390 driver ( 394.48 ), then upgraded to 396 and got this 'lightdm<->nvidia driver' problem. It seems that mostly users are getting