nvidia

Is it possible to manually set the SMs used for one CUDA stream?

强颜欢笑 提交于 2021-02-05 10:51:14
问题 By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting the maximal SMs, the kernel in one stream will use up to maximal SMs, like 20SMs for computational-intense and 4SMs for memory-intense), is it possible to do so? (if possible, which API should I use) 回答1: In short, no there is no way to do what

Is it possible to manually set the SMs used for one CUDA stream?

三世轮回 提交于 2021-02-05 10:51:05
问题 By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting the maximal SMs, the kernel in one stream will use up to maximal SMs, like 20SMs for computational-intense and 4SMs for memory-intense), is it possible to do so? (if possible, which API should I use) 回答1: In short, no there is no way to do what

torch.cuda.is_avaiable returns False with nvidia-smi not working

淺唱寂寞╮ 提交于 2021-02-05 10:43:33
问题 I'm trying to build a docker image that can run using GPUS, this my situation: I have python 3.6 and I am starting from image nvidia/cuda:10.0-cudnn7-devel. Torch does not see my GPUs. nvidia-smi is not working too, returning error: > Failed to initialize NVML: Unknown Error > The command '/bin/sh -c nvidia-smi' returned a non-zero code: 255 I installed nvidia toolkit and nvidia-smi with RUN apt install nvidia-cuda-toolkit -y RUN apt-get install nvidia-utils-410 -y 回答1: I figured out the

cuda 11 kernel doesn't run

僤鯓⒐⒋嵵緔 提交于 2021-02-05 09:10:30
问题 here is a demo.cu aiming to printf from the GPU device: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> __global__ void hello_cuda() { printf("hello from GPU\n"); } int main() { printf("hello from CPU\n"); hello_cuda <<<1, 1>>> (); cudaDeviceSynchronize(); cudaDeviceReset(); printf("bye bye from CPU\n"); return 0; } it compiles and runs: $ nvcc demo.cu $ ./a.out that's the output that I get: hello from CPU bye bye from CPU Q: why there is no printing result

cuda 11 kernel doesn't run

匆匆过客 提交于 2021-02-05 09:09:23
问题 here is a demo.cu aiming to printf from the GPU device: #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> __global__ void hello_cuda() { printf("hello from GPU\n"); } int main() { printf("hello from CPU\n"); hello_cuda <<<1, 1>>> (); cudaDeviceSynchronize(); cudaDeviceReset(); printf("bye bye from CPU\n"); return 0; } it compiles and runs: $ nvcc demo.cu $ ./a.out that's the output that I get: hello from CPU bye bye from CPU Q: why there is no printing result

Structure in Texture memory on CUDA

痞子三分冷 提交于 2021-02-04 08:28:25
问题 I have an array containing a structure of two elements, that I send to CUDA in global memory, and I read the values from global memory. As I read through some books and posts, and as I am only reading values from the structure, I thought i would be interesting if it was possible to store my array in Texture memory. I used the following code outside the kernel : texture<node, cudaTextureType1D, cudaReadModeElementType> textureNode; and the following lines in main() gpuErrchk(cudaMemcpy(tree_d,

No GPU activities in profiling with nvprof

Deadly 提交于 2021-01-29 09:57:04
问题 I run nvprof.exe on the function that initialize data, calls three kernels and free's data. All profiled as it should and I got result like this: ==7956== Profiling application: .\a.exe ==7956== Profiling result: GPU activities: 52.34% 25.375us 1 25.375us 25.375us 25.375us th_single_row_add(float*, float*, float*) 43.57% 21.120us 1 21.120us 21.120us 21.120us th_single_col_add(float*, float*, float*) 4.09% 1.9840us 1 1.9840us 1.9840us 1.9840us th_single_elem_add(float*, float*, float*) API

Process strings form OpenCL kernel

喜欢而已 提交于 2021-01-29 07:22:46
问题 There are several strings like std::string first, second, third; ... My plan was to collect their addresses into a char* array: char *addresses = {&first[0], &second[0], &third[0]} ... and pass the char **addresses to the OpenCL kernel. There are several problems or questions: The main issue is that I cannot pass array of pointers. Is there any good way to use many-many strings from the kernel code without copying them but leave them in the shared memory? I'm using NVIDIA on Windows. So, I

Calculate the angle between two triangles in CUDA

我的未来我决定 提交于 2021-01-29 06:08:49
问题 I wanted to calculate the angle between two triangles in 3D space. The two triangles will always share exactly two points. e.g. Triangle 1: Point1 (x1, y1, z1), Point2 (x2, y2, z2), Point3 (x3, y3, z3). Triangle 2: Point1 (x1, y1, z1), Point2 (x2, y2, z2), Point4 (x4, y4, z4). Is there a way to calculate the angle between them efficiently in CUDA? 回答1: For each plane, you need to construct it's normal vector (perpendicular to all lines in that plane). The simple way to do that is to take the

Pytorch CUDA error: no kernel image is available for execution on the device on RTX 3090 with cuda 11.1

岁酱吖の 提交于 2021-01-28 01:50:34
问题 If I run the following: import torch import sys print('A', sys.version) print('B', torch.__version__) print('C', torch.cuda.is_available()) print('D', torch.backends.cudnn.enabled) device = torch.device('cuda') print('E', torch.cuda.get_device_properties(device)) print('F', torch.tensor([1.0, 2.0]).cuda()) I get this: A 3.7.5 (default, Nov 7 2019, 10:50:52) [GCC 8.3.0] B 1.8.0.dev20210115+cu110 C True D True E _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory