nsight | 易学教程

Include a static cuda library into a c++ project

阅读更多关于 Include a static cuda library into a c++ project

I have a templated static CUDA library which I want to include into a common c++ project. When I include the headers of the library the compiler crashes and says It cannot resolve the CUDA-specific symbols. Of course the g++ compiler cannot interpret these symbols. I know the problem, but I do not know how to fix this problem using the nsight IDE. I'm using nsight for both, the cuda/nvcc library and the c++/g++ project. Console output: make all Building file: ../src/MedPrak.cpp Invoking: GCC C++ Compiler g++ -I/home/voodoocode/Praktikum/MedPrak/PrivateRepo/MedPrakCuda/src -O0 -g3 -Wall -c

NVIDIA Parallel Nsight Vs Visual Profiler

阅读更多关于 NVIDIA Parallel Nsight Vs Visual Profiler

I am working with CUDA on the windows platform. On the windows platform we have access to both Parallel Nsight and Visual Profiler. Both are pretty good but then they have almost similar features for profiling and tracing. Can someone say me how are they both different and which one is better for the windows platform ?? I will basically be needing a tool for profiling. Nsight Visual Studio Edition 2.2 offers the following advantages over the Visual Profiler: OVERALL Integration into Visual Studio 2008 SP1 and 2010 (requires Professional Edition as VS Express Edition does not support

Time between Kernel Launch and Kernel Execution

阅读更多关于 Time between Kernel Launch and Kernel Execution

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010. My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver. One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host. As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is

CUDA: Nsight VS2010 profile device function

阅读更多关于 CUDA: Nsight VS2010 profile __device__ function

I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0. Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments: Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls

Time between Kernel Launch and Kernel Execution

阅读更多关于 Time between Kernel Launch and Kernel Execution

问题 I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010. My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver. One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host. As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel

Why do I get “Unspecified Launch failure” in CUDA program, multiplying 2 matrices

阅读更多关于 Why do I get “Unspecified Launch failure” in CUDA program, multiplying 2 matrices

I am new to CUDA. When I multiply the 1024x1024 matrix, and launch a kernel with: multiplyKernel << <dim3(32,32, 1), dim3(32, 32, 1) >> >(dev_c, dev_a, dev_b, size); But when I multiply a 2048 x 2048 matrix, with dim3(64,64,1) I get this error: cudaDeviceSynchronize returned error code 4 after launching addKernel! unspecified launch failure From tinkering with the code, I think that the error is in this statement result += a[row * size + ind] * b[col + size * ind]; in the part b[col+size*ind] If I take that out, I don't get a kernel launch error (just the wrong answer, obviously). I cannot

CUDA: Nsight VS2010 profile device function

阅读更多关于 CUDA: Nsight VS2010 profile __device__ function

问题 I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0. 回答1: Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments: Instruction Count - Collects instructions executed, thread instructions executed, active

slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

阅读更多关于 slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice The Test program: The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding) Before the 1st cudaMalloc, I’ve called “cudaSetDevice” on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms on my server: Win2012+ Tesla K40, it takes 1100ms!! For both machines, subsequent cudaMalloc are much faster. My questions are: 1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st

Why do I get “Unspecified Launch failure” in CUDA program, multiplying 2 matrices

阅读更多关于 Why do I get “Unspecified Launch failure” in CUDA program, multiplying 2 matrices

问题 I am new to CUDA. When I multiply the 1024x1024 matrix, and launch a kernel with: multiplyKernel << <dim3(32,32, 1), dim3(32, 32, 1) >> >(dev_c, dev_a, dev_b, size); But when I multiply a 2048 x 2048 matrix, with dim3(64,64,1) I get this error: cudaDeviceSynchronize returned error code 4 after launching addKernel! unspecified launch failure From tinkering with the code, I think that the error is in this statement result += a[row * size + ind] * b[col + size * ind]; in the part b[col+size*ind]

slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

阅读更多关于 slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

问题 I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice The Test program: The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding) Before the 1st cudaMalloc, I’ve called “cudaSetDevice” on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms on my server: Win2012+ Tesla K40, it takes 1100ms!! For both machines, subsequent cudaMalloc