nvidia | 易学教程

Is it possible to manually set the SMs used for one CUDA stream?

阅读更多关于 Is it possible to manually set the SMs used for one CUDA stream?

问题 By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting the maximal SMs, the kernel in one stream will use up to maximal SMs, like 20SMs for computational-intense and 4SMs for memory-intense), is it possible to do so? (if possible, which API should I use) 回答1: In short, no there is no way to do what

Is it possible to manually set the SMs used for one CUDA stream?

阅读更多关于 Is it possible to manually set the SMs used for one CUDA stream?

torch.cuda.is_avaiable returns False with nvidia-smi not working

阅读更多关于 torch.cuda.is_avaiable returns False with nvidia-smi not working

问题 I'm trying to build a docker image that can run using GPUS, this my situation: I have python 3.6 and I am starting from image nvidia/cuda:10.0-cudnn7-devel. Torch does not see my GPUs. nvidia-smi is not working too, returning error: > Failed to initialize NVML: Unknown Error > The command '/bin/sh -c nvidia-smi' returned a non-zero code: 255 I installed nvidia toolkit and nvidia-smi with RUN apt install nvidia-cuda-toolkit -y RUN apt-get install nvidia-utils-410 -y 回答1: I figured out the

cuda 11 kernel doesn't run

阅读更多关于 cuda 11 kernel doesn't run

问题 here is a demo.cu aiming to printf from the GPU device: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> __global__ void hello_cuda() { printf("hello from GPU\n"); } int main() { printf("hello from CPU\n"); hello_cuda <<<1, 1>>> (); cudaDeviceSynchronize(); cudaDeviceReset(); printf("bye bye from CPU\n"); return 0; } it compiles and runs: $ nvcc demo.cu $ ./a.out that's the output that I get: hello from CPU bye bye from CPU Q: why there is no printing result

cuda 11 kernel doesn't run

阅读更多关于 cuda 11 kernel doesn't run

Structure in Texture memory on CUDA

阅读更多关于 Structure in Texture memory on CUDA

问题 I have an array containing a structure of two elements, that I send to CUDA in global memory, and I read the values from global memory. As I read through some books and posts, and as I am only reading values from the structure, I thought i would be interesting if it was possible to store my array in Texture memory. I used the following code outside the kernel : texture<node, cudaTextureType1D, cudaReadModeElementType> textureNode; and the following lines in main() gpuErrchk(cudaMemcpy(tree_d,

No GPU activities in profiling with nvprof

阅读更多关于 No GPU activities in profiling with nvprof

问题 I run nvprof.exe on the function that initialize data, calls three kernels and free's data. All profiled as it should and I got result like this: ==7956== Profiling application: .\a.exe ==7956== Profiling result: GPU activities: 52.34% 25.375us 1 25.375us 25.375us 25.375us th_single_row_add(float*, float*, float*) 43.57% 21.120us 1 21.120us 21.120us 21.120us th_single_col_add(float*, float*, float*) 4.09% 1.9840us 1 1.9840us 1.9840us 1.9840us th_single_elem_add(float*, float*, float*) API

Process strings form OpenCL kernel

阅读更多关于 Process strings form OpenCL kernel

问题 There are several strings like std::string first, second, third; ... My plan was to collect their addresses into a char* array: char *addresses = {&first[0], &second[0], &third[0]} ... and pass the char **addresses to the OpenCL kernel. There are several problems or questions: The main issue is that I cannot pass array of pointers. Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory? I'm using NVIDIA on Windows. So, I

Calculate the angle between two triangles in CUDA

阅读更多关于 Calculate the angle between two triangles in CUDA

问题 I wanted to calculate the angle between two triangles in 3D space. The two triangles will always share exactly two points. e.g. Triangle 1: Point1 (x1, y1, z1), Point2 (x2, y2, z2), Point3 (x3, y3, z3). Triangle 2: Point1 (x1, y1, z1), Point2 (x2, y2, z2), Point4 (x4, y4, z4). Is there a way to calculate the angle between them efficiently in CUDA? 回答1: For each plane, you need to construct it's normal vector (perpendicular to all lines in that plane). The simple way to do that is to take the

Pytorch CUDA error: no kernel image is available for execution on the device on RTX 3090 with cuda 11.1

阅读更多关于 Pytorch CUDA error: no kernel image is available for execution on the device on RTX 3090 with cuda 11.1

问题 If I run the following: import torch import sys print('A', sys.version) print('B', torch.__version__) print('C', torch.cuda.is_available()) print('D', torch.backends.cudnn.enabled) device = torch.device('cuda') print('E', torch.cuda.get_device_properties(device)) print('F', torch.tensor([1.0, 2.0]).cuda()) I get this: A 3.7.5 (default, Nov 7 2019, 10:50:52) [GCC 8.3.0] B 1.8.0.dev20210115+cu110 C True D True E _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory