nvvp


Profiling arbitrary CUDA applications

試著忘記壹切 提交于 2019-12-22 12:47:07
问题 I know of the existence of nvvp and nvprof , of course, but for various reasons nvprof does not want to work with my app that involves lots of shared libraries. nvidia-smi can hook into the driver to find out what's running, but I cannot find a nice way to get nvprof to attach to a running process. There is a flag --profile-all-processes which does actually give me a message "NVPROF is profiling process 12345", but nothing further prints out. I am using CUDA 8. How can I get a detailed

How to observe CUDA events and metrics for a subsection of an executable (e.g. only during a kernel execution time)?

假如想象 提交于 2019-12-19 09:03:22
问题 I'm familiar with using nvprof to access the events and metrics of a benchmark, e.g., nvprof --system-profiling on --print-gpu-trace -o (file name) --events inst_issued1 ./benchmarkname The system-profiling on --print-gpu-trace -o (filename) command gives timestamps for start time, kernel end times, power, temp and saves the info an nvvp files so we can view it in the visual profiler. This allows us to see what's happening in any section of a code, in particular when a specific kernel is

CUDA Visual profiler over a remote X session

南笙酒味 提交于 2019-12-13 04:29:02
问题 I am running an Ubuntu 11.10 server, CUDA-5.0 with a GTX480 on it. I am trying to run the visual profiler remotely by using Xming and Cygwin/X on Windows 8. I can successfully run xclocks, but when I try to launch /usr/local/cuda-5.0/bin/nvvp from the putty command line, it just silently exits without any errors or warnings. I installed the default config of Cygwin/X with xorg-server, xinit and openssh packages. Do I need any more packages? I do not want to use the command line profiler as I

nvvp and nsight's profiler give a different result?

六月ゝ 毕业季﹏ 提交于 2019-12-13 03:44:53
问题 I want to try gst_inst_128bit instruction. In the same program, nvvp give a lot of gst_inst_128bit command executed. While in nsight's profiler, 4 times gst_inst_32bit instructions is obtained. They should be a same program. How could this situation happen? The experiment was tried on Linux, CUDA 5.0, GTX 580. The program is only copying data from one array to another in kernel function: In main: cudaMalloc((void**)&dev_a, NUM * sizeof(float)); cudaMalloc((void**)&dev_b, NUM * sizeof(float));

Cuda zero-copy performance

大憨熊 提交于 2019-12-02 06:52:17
问题 Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model? I have a kernel that uses the zero-copy feature and with NVVP I see the following: Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead. When I really jack up the problem size, I get an instruction replay

Cuda zero-copy performance

狂风中的少年 提交于 2019-12-02 04:00:00
Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory ) memory model? I have a kernel that uses the zero-copy feature and with NVVP I see the following: Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead. When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead. However, the global load efficiency

How to observe CUDA events and metrics for a subsection of an executable (e.g. only during a kernel execution time)?

社会主义新天地 提交于 2019-12-01 07:39:53
I'm familiar with using nvprof to access the events and metrics of a benchmark, e.g., nvprof --system-profiling on --print-gpu-trace -o (file name) --events inst_issued1 ./benchmarkname The system-profiling on --print-gpu-trace -o (filename) command gives timestamps for start time, kernel end times, power, temp and saves the info an nvvp files so we can view it in the visual profiler. This allows us to see what's happening in any section of a code, in particular when a specific kernel is running. My question is this-- Is there a way to isolate the events counted for only a section of the

工具导航Map